PyTorch / Gensim - How to load pre-trained word embeddings
I just wanted to report my findings about loading a gensim embedding with PyTorch.
Solution for PyTorch
0.4.0
and newer:
From v0.4.0
there is a new function from_pretrained()
which makes loading an embedding very comfortable.Here is an example from the documentation.
import torchimport torch.nn as nn# FloatTensor containing pretrained weightsweight = torch.FloatTensor([[1, 2.3, 3], [4, 5.1, 6.3]])embedding = nn.Embedding.from_pretrained(weight)# Get embeddings for index 1input = torch.LongTensor([1])embedding(input)
The weights from gensim can easily be obtained by:
import gensimmodel = gensim.models.KeyedVectors.load_word2vec_format('path/to/file')weights = torch.FloatTensor(model.vectors) # formerly syn0, which is soon deprecated
As noted by @Guglie: in newer gensim versions the weights can be obtained by model.wv
:
weights = model.wv
Solution for PyTorch version
0.3.1
and older:
I'm using version 0.3.1
and from_pretrained()
isn't available in this version.
Therefore I created my own from_pretrained
so I can also use it with 0.3.1
.
Code for from_pretrained
for PyTorch versions 0.3.1
or lower:
def from_pretrained(embeddings, freeze=True): assert embeddings.dim() == 2, \ 'Embeddings parameter is expected to be 2-dimensional' rows, cols = embeddings.shape embedding = torch.nn.Embedding(num_embeddings=rows, embedding_dim=cols) embedding.weight = torch.nn.Parameter(embeddings) embedding.weight.requires_grad = not freeze return embedding
The embedding can be loaded then just like this:
embedding = from_pretrained(weights)
I hope this is helpful for someone.
I think it is easy. Just copy the embedding weight from gensim to the corresponding weight in PyTorch embedding layer.
You need to make sure two things are correct: first is that the weight shape has to be correct, second is that the weight has to be converted to PyTorch FloatTensor type.
from gensim.models import Word2Vecmodel = Word2Vec(reviews,size=100, window=5, min_count=5, workers=4)#gensim model createdimport torchweights = torch.FloatTensor(model.wv.vectors)embedding = nn.Embedding.from_pretrained(weights)