Use LSTM tutorial code to predict next word in a sentence? Use LSTM tutorial code to predict next word in a sentence? python python

Use LSTM tutorial code to predict next word in a sentence?


Main Question

Loading words

Load custom data instead of using the test set:

reader.py@ptb_raw_datatest_path = os.path.join(data_path, "ptb.test.txt")test_data = _file_to_word_ids(test_path, word_to_id)  # change this line

test_data should contain word ids (print out word_to_id for a mapping). As an example, it should look like: [1, 52, 562, 246] ...

Displaying predictions

We need to return the output of the FC layer (logits) in the call to sess.run

ptb_word_lm.py@PTBModel.__init__    logits = tf.reshape(logits, [self.batch_size, self.num_steps, vocab_size])    self.top_word_id = tf.argmax(logits, axis=2)  # add this lineptb_word_lm.py@run_epoch  fetches = {      "cost": model.cost,      "final_state": model.final_state,      "top_word_id": model.top_word_id # add this line  }

Later in the function, vals['top_word_id'] will have an array of integers with the ID of the top word. Look this up in word_to_id to determine the predicted word. I did this a while ago with the small model, and the top 1 accuracy was pretty low (20-30% iirc), even though the perplexity was what was predicted in the header.

Subquestions

Why use a random (uninitialized, untrained) word-embedding?

You'd have to ask the authors, but in my opinion, training the embeddings makes this more of a standalone tutorial: instead of treating embedding as a black box, it shows how it works.

Why use softmax?

The final prediction is not determined by the cosine similarity to the output of the hidden layer. There is an FC layer after the LSTM that converts the embedded state to a one-hot encoding of the final word.

Here's a sketch of the operations and dimensions in the neural net:

word -> one hot code (1 x vocab_size) -> embedding (1 x hidden_size) -> LSTM -> FC layer (1 x vocab_size) -> softmax (1 x vocab_size)

Does the hidden layer have to match the dimension of the input (i.e. the dimension of the word2vec embeddings)

Technically, no. If you look at the LSTM equations, you'll notice that x (the input) can be any size, as long as the weight matrix is adjusted appropriately.

LSTM equations

How/Can I bring in a pre-trained word2vec model, instead of that uninitialized one?

I don't know, sorry.


My biggest question is how do I use the produced model to actually generate a next word suggestion, given the first few words of a sentence?

I.e. I'm trying to write a function with the signature: getNextWord(model, sentencePrefix)

Before I explain my answer, first a remark about your suggestion to # Call static_rnn(cell) once for each word in prefix to initialize state: Keep in mind that static_rnn does not return a value like a numpy array, but a tensor. You can evaluate a tensor to a value when it is run (1) in a session (a session is keeps the state of your computional graph, including the values of your model parameters) and (2) with the input that is necessary to calculate the tensor value. Input can be supplied using input readers (the approach in the tutorial), or using placeholders (what I will use below).

Now follows the actual answer:The model in the tutorial was designed to read input data from a file. The answer of @user3080953 already showed how to work with your own text file, but as I understand it you need more control over how the data is fed to the model. To do this you will need to define your own placeholders and feed the data to these placeholders when calling session.run().

In the code below I subclassed PTBModel and made it responsible for explicitly feeding data to the model. I introduced a special PTBInteractiveInput that has an interface similar to PTBInput so you can reuse the functionality in PTBModel. To train your model you still need PTBModel.

class PTBInteractiveInput(object):  def __init__(self, config):    self.batch_size = 1    self.num_steps = config.num_steps    self.input_data = tf.placeholder(dtype=tf.int32, shape=[self.batch_size, self.num_steps])    self.sequence_len = tf.placeholder(dtype=tf.int32, shape=[])    self.targets = tf.placeholder(dtype=tf.int32, shape=[self.batch_size, self.num_steps])class InteractivePTBModel(PTBModel):  def __init__(self, config):    input = PTBInteractiveInput(config)    PTBModel.__init__(self, is_training=False, config=config, input_=input)    output = self.logits[:, self._input.sequence_len - 1, :]    self.top_word_id = tf.argmax(output, axis=2)  def get_next(self, session, prefix):    prefix_array, sequence_len = self._preprocess(prefix)    feeds = {      self._input.sequence_len: sequence_len,      self._input.input_data: prefix_array,    }    fetches = [self.top_word_id]    result = session.run(fetches, feeds)    self._postprocess(result)  def _preprocess(self, prefix):    num_steps = self._input.num_steps    seq_len = len(prefix)    if seq_len > num_steps:      raise ValueError("Prefix to large for model.")    prefix_ids = self._prefix_to_ids(prefix)    num_items_to_pad = num_steps - seq_len    prefix_ids.extend([0] * num_items_to_pad)    prefix_array = np.array([prefix_ids], dtype=np.float32)    return prefix_array, seq_len  def _prefix_to_ids(self, prefix):    # should convert your prefix to a list of ids    pass  def _postprocess(self, result):    # convert ids back to strings    pass

In the __init__ function of PTBModel you need to add this line:

self.logits = logits

Why use a random (uninitialized, untrained) word-embedding?

First note that, although the embeddings are random in the beginning, they will be trained with the rest of the network. The embeddings you obtain after training will have similar properties than the embeddings you obtain with word2vec models, e.g., the ability to answer analogy questions with vector operations (king - man + woman = queen, etc.) In tasks were you have a considerable amount of training data like language modelling (which does not need annotated training data) or neural machine translation, it is more common to train embeddings from scratch.

Why use softmax?

Softmax is a function that normalizes a vector of similarity scores (the logits), to a probability distribution. You need a probability distribution to train you model with cross-entropy loss and to be able to sample from the model. Note that if you are only interested in the most likely words of a trained model, you don't need the softmax and you can use the logits directly.

Does the hidden layer have to match the dimension of the input (i.e. the dimension of the word2vec embeddings)

No, in principal it can be any value. Using a hidden state with a lower dimension than your embedding dimension, does not make much sense, however.

How/Can I bring in a pre-trained word2vec model, instead of that uninitialized one?

Here is a self-contained example of initializing an embedding with a given numpy array. If you want that the embedding remains fixed/constant during training, set trainable to False.

import tensorflow as tfimport numpy as npvocab_size = 10000size = 200trainable=Trueembedding_matrix = np.zeros([vocab_size, size]) # replace this with code to load your pretrained embeddingembedding = tf.get_variable("embedding",                            initializer=tf.constant_initializer(embedding_matrix),                            shape=[vocab_size, size],                            dtype=tf.float32,                            trainable=trainable)


There are many questions, I would try to clarify some of them.

how do I use the produced model to actually generate a next word suggestion, given the first few words of a sentence?

The key point here is, next word generation is actually word classification in the vocabulary. So you need a classifier, that is why there is a softmax in the output.

The principle is, at each time step, the model would output the next word based on the last word embedding and internal memory of previous words. tf.contrib.rnn.static_rnn automatically combine input into the memory, but we need to provide the last word embedding and classify the next word.

We can use a pre-trained word2vec model, just init the embedding matrix with the pre-trained one. I think the tutorial uses random matrix for the sake of simplicity. Memory size is not related to embedding size, you can use larger memory size to retain more information.

These tutorials are high-level. If you want to deeply understand the details, I would suggest looking at the source code in plain python/numpy.