Use LSTM tutorial code to predict next word in a sentence?
Main Question
Loading words
Load custom data instead of using the test set:
reader.py@ptb_raw_datatest_path = os.path.join(data_path, "ptb.test.txt")test_data = _file_to_word_ids(test_path, word_to_id) # change this line
test_data
should contain word ids (print out word_to_id
for a mapping). As an example, it should look like: [1, 52, 562, 246] ...
Displaying predictions
We need to return the output of the FC layer (logits
) in the call to sess.run
ptb_word_lm.py@PTBModel.__init__ logits = tf.reshape(logits, [self.batch_size, self.num_steps, vocab_size]) self.top_word_id = tf.argmax(logits, axis=2) # add this lineptb_word_lm.py@run_epoch fetches = { "cost": model.cost, "final_state": model.final_state, "top_word_id": model.top_word_id # add this line }
Later in the function, vals['top_word_id']
will have an array of integers with the ID of the top word. Look this up in word_to_id
to determine the predicted word. I did this a while ago with the small model, and the top 1 accuracy was pretty low (20-30% iirc), even though the perplexity was what was predicted in the header.
Subquestions
Why use a random (uninitialized, untrained) word-embedding?
You'd have to ask the authors, but in my opinion, training the embeddings makes this more of a standalone tutorial: instead of treating embedding as a black box, it shows how it works.
Why use softmax?
The final prediction is not determined by the cosine similarity to the output of the hidden layer. There is an FC layer after the LSTM that converts the embedded state to a one-hot encoding of the final word.
Here's a sketch of the operations and dimensions in the neural net:
word -> one hot code (1 x vocab_size) -> embedding (1 x hidden_size) -> LSTM -> FC layer (1 x vocab_size) -> softmax (1 x vocab_size)
Does the hidden layer have to match the dimension of the input (i.e. the dimension of the word2vec embeddings)
Technically, no. If you look at the LSTM equations, you'll notice that x (the input) can be any size, as long as the weight matrix is adjusted appropriately.
How/Can I bring in a pre-trained word2vec model, instead of that uninitialized one?
I don't know, sorry.
My biggest question is how do I use the produced model to actually generate a next word suggestion, given the first few words of a sentence?
I.e. I'm trying to write a function with the signature: getNextWord(model, sentencePrefix)
Before I explain my answer, first a remark about your suggestion to # Call static_rnn(cell) once for each word in prefix to initialize state
: Keep in mind that static_rnn
does not return a value like a numpy array, but a tensor. You can evaluate a tensor to a value when it is run (1) in a session (a session is keeps the state of your computional graph, including the values of your model parameters) and (2) with the input that is necessary to calculate the tensor value. Input can be supplied using input readers (the approach in the tutorial), or using placeholders (what I will use below).
Now follows the actual answer:The model in the tutorial was designed to read input data from a file. The answer of @user3080953 already showed how to work with your own text file, but as I understand it you need more control over how the data is fed to the model. To do this you will need to define your own placeholders and feed the data to these placeholders when calling session.run()
.
In the code below I subclassed PTBModel
and made it responsible for explicitly feeding data to the model. I introduced a special PTBInteractiveInput
that has an interface similar to PTBInput
so you can reuse the functionality in PTBModel
. To train your model you still need PTBModel
.
class PTBInteractiveInput(object): def __init__(self, config): self.batch_size = 1 self.num_steps = config.num_steps self.input_data = tf.placeholder(dtype=tf.int32, shape=[self.batch_size, self.num_steps]) self.sequence_len = tf.placeholder(dtype=tf.int32, shape=[]) self.targets = tf.placeholder(dtype=tf.int32, shape=[self.batch_size, self.num_steps])class InteractivePTBModel(PTBModel): def __init__(self, config): input = PTBInteractiveInput(config) PTBModel.__init__(self, is_training=False, config=config, input_=input) output = self.logits[:, self._input.sequence_len - 1, :] self.top_word_id = tf.argmax(output, axis=2) def get_next(self, session, prefix): prefix_array, sequence_len = self._preprocess(prefix) feeds = { self._input.sequence_len: sequence_len, self._input.input_data: prefix_array, } fetches = [self.top_word_id] result = session.run(fetches, feeds) self._postprocess(result) def _preprocess(self, prefix): num_steps = self._input.num_steps seq_len = len(prefix) if seq_len > num_steps: raise ValueError("Prefix to large for model.") prefix_ids = self._prefix_to_ids(prefix) num_items_to_pad = num_steps - seq_len prefix_ids.extend([0] * num_items_to_pad) prefix_array = np.array([prefix_ids], dtype=np.float32) return prefix_array, seq_len def _prefix_to_ids(self, prefix): # should convert your prefix to a list of ids pass def _postprocess(self, result): # convert ids back to strings pass
In the __init__
function of PTBModel
you need to add this line:
self.logits = logits
Why use a random (uninitialized, untrained) word-embedding?
First note that, although the embeddings are random in the beginning, they will be trained with the rest of the network. The embeddings you obtain after training will have similar properties than the embeddings you obtain with word2vec models, e.g., the ability to answer analogy questions with vector operations (king - man + woman = queen, etc.) In tasks were you have a considerable amount of training data like language modelling (which does not need annotated training data) or neural machine translation, it is more common to train embeddings from scratch.
Why use softmax?
Softmax is a function that normalizes a vector of similarity scores (the logits), to a probability distribution. You need a probability distribution to train you model with cross-entropy loss and to be able to sample from the model. Note that if you are only interested in the most likely words of a trained model, you don't need the softmax and you can use the logits directly.
Does the hidden layer have to match the dimension of the input (i.e. the dimension of the word2vec embeddings)
No, in principal it can be any value. Using a hidden state with a lower dimension than your embedding dimension, does not make much sense, however.
How/Can I bring in a pre-trained word2vec model, instead of that uninitialized one?
Here is a self-contained example of initializing an embedding with a given numpy array. If you want that the embedding remains fixed/constant during training, set trainable
to False
.
import tensorflow as tfimport numpy as npvocab_size = 10000size = 200trainable=Trueembedding_matrix = np.zeros([vocab_size, size]) # replace this with code to load your pretrained embeddingembedding = tf.get_variable("embedding", initializer=tf.constant_initializer(embedding_matrix), shape=[vocab_size, size], dtype=tf.float32, trainable=trainable)
There are many questions, I would try to clarify some of them.
how do I use the produced model to actually generate a next word suggestion, given the first few words of a sentence?
The key point here is, next word generation is actually word classification in the vocabulary. So you need a classifier, that is why there is a softmax in the output.
The principle is, at each time step, the model would output the next word based on the last word embedding and internal memory of previous words. tf.contrib.rnn.static_rnn
automatically combine input into the memory, but we need to provide the last word embedding and classify the next word.
We can use a pre-trained word2vec model, just init the embedding
matrix with the pre-trained one. I think the tutorial uses random matrix for the sake of simplicity. Memory size is not related to embedding size, you can use larger memory size to retain more information.
These tutorials are high-level. If you want to deeply understand the details, I would suggest looking at the source code in plain python/numpy.