Load PreComputed Vectors Gensim

python nlp gensim word2vec

The GloVe dump from the Stanford site is in a format that is little different from the word2vec format. You can convert the GloVe file into word2vec format using:

python -m gensim.scripts.glove2word2vec --input  glove.840B.300d.txt --output glove.840B.300d.w2vformat.txt

python nlp gensim word2vec

You can download pre-trained word vectors from here (get the file 'GoogleNews-vectors-negative300.bin'):word2vec

Extract the file and then you can load it in python like:

model = gensim.models.word2vec.Word2Vec.load_word2vec_format(os.path.join(os.path.dirname(__file__), 'GoogleNews-vectors-negative300.bin'), binary=True)model.most_similar('dog')

EDIT (May 2017):As the above code is now deprecated, this is how you'd load the vectors now:

model = gensim.models.KeyedVectors.load_word2vec_format(os.path.join(os.path.dirname(__file__), 'GoogleNews-vectors-negative300.bin'), binary=True)

python nlp gensim word2vec

As far as I know, Gensim can load two binary formats, word2vec and fastText, and a generic plain text format which can be created by most word embedding tools. The generic plain text format looks like this (in this example 20000 is the size of the vocabulary and 100 is the length of vector)

20000 100the 0.476841 -0.620207 -0.002157 0.359706 -0.591816 [98 more numbers...]and 0.223408  0.231993 -0.231131 -0.900311 -0.225111 [98 more numbers..][19998 more lines...]

Chaitanya Shivade has explained in his answer here, how to use a script provided by Gensim to convert the Glove format (each line: word + vector) into the generic format.

Loading the different formats is easy, but it is also easy to get them mixed up:

import gensimmodel_file = path/to/model/file

1) Loading binary word2vec

model = gensim.models.word2vec.Word2Vec.load_word2vec_format(model_file)

2) Loading binary fastText

model = gensim.models.fasttext.FastText.load_fasttext_format(model_file)

3) Loading the generic plain text format (which has been introduced by word2vec)

model = gensim.models.keyedvectors.Word2VecKeyedVectors.load_word2vec_format(model_file)

If you only plan to use the word embeddings and not to continue to train them in Gensim, you may want to use the KeyedVector class. This will reduce the amount of memory you need to load the vectors considerably (detailed explanation).

The following will load the binary word2vec format as keyedvectors:

model = gensim.models.keyedvectors.Word2VecKeyedVectors.load_word2vec_format(model_file, binary=True)

CodeHunter

Load PreComputed Vectors Gensim

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last