Load PreComputed Vectors Gensim Load PreComputed Vectors Gensim python python

Load PreComputed Vectors Gensim


The GloVe dump from the Stanford site is in a format that is little different from the word2vec format. You can convert the GloVe file into word2vec format using:

python -m gensim.scripts.glove2word2vec --input  glove.840B.300d.txt --output glove.840B.300d.w2vformat.txt


You can download pre-trained word vectors from here (get the file 'GoogleNews-vectors-negative300.bin'):word2vec

Extract the file and then you can load it in python like:

model = gensim.models.word2vec.Word2Vec.load_word2vec_format(os.path.join(os.path.dirname(__file__), 'GoogleNews-vectors-negative300.bin'), binary=True)model.most_similar('dog')

EDIT (May 2017):As the above code is now deprecated, this is how you'd load the vectors now:

model = gensim.models.KeyedVectors.load_word2vec_format(os.path.join(os.path.dirname(__file__), 'GoogleNews-vectors-negative300.bin'), binary=True)


As far as I know, Gensim can load two binary formats, word2vec and fastText, and a generic plain text format which can be created by most word embedding tools. The generic plain text format looks like this (in this example 20000 is the size of the vocabulary and 100 is the length of vector)

20000 100the 0.476841 -0.620207 -0.002157 0.359706 -0.591816 [98 more numbers...]and 0.223408  0.231993 -0.231131 -0.900311 -0.225111 [98 more numbers..][19998 more lines...]

Chaitanya Shivade has explained in his answer here, how to use a script provided by Gensim to convert the Glove format (each line: word + vector) into the generic format.

Loading the different formats is easy, but it is also easy to get them mixed up:

import gensimmodel_file = path/to/model/file

1) Loading binary word2vec

model = gensim.models.word2vec.Word2Vec.load_word2vec_format(model_file)

2) Loading binary fastText

model = gensim.models.fasttext.FastText.load_fasttext_format(model_file)

3) Loading the generic plain text format (which has been introduced by word2vec)

model = gensim.models.keyedvectors.Word2VecKeyedVectors.load_word2vec_format(model_file)

If you only plan to use the word embeddings and not to continue to train them in Gensim, you may want to use the KeyedVector class. This will reduce the amount of memory you need to load the vectors considerably (detailed explanation).

The following will load the binary word2vec format as keyedvectors:

model = gensim.models.keyedvectors.Word2VecKeyedVectors.load_word2vec_format(model_file, binary=True)