Using scikit-learn vectorizers and vocabularies with gensim Using scikit-learn vectorizers and vocabularies with gensim python python

Using scikit-learn vectorizers and vocabularies with gensim


Gensim doesn't require Dictionary objects. You can use your plain dict as input to id2word directly, as long as it maps ids (integers) to words (strings).

In fact anything dict-like will do (including dict, Dictionary, SqliteDict...).

(Btw gensim's Dictionary is a simple Python dict underneath.Not sure where your remarks on Dictionary performance come from, you can't get a mapping much faster than a plain dict in Python. Maybe you're confusing it with text preprocessing (not part of gensim), which can indeed be slow.)


Just to provide with a final example, scikit-learn's vectorizers objects can be transformad into gensim's corpus format with Sparse2Corpus while the vocabulary dict can be recycled by simply swapping keys and values:

# transform sparse matrix into gensim corpuscorpus_vect_gensim = gensim.matutils.Sparse2Corpus(corpus_vect, documents_columns=False)# transform scikit vocabulary into gensim dictionaryvocabulary_gensim = {}for key, val in vect.vocabulary_.items():    vocabulary_gensim[val] = key


I am also running some code experiments using these two. Apparently there's a way to construct the dictionary from corpus now

from gensim.corpora.dictionary import Dictionarydictionary = Dictionary.from_corpus(corpus_vect_gensim,                                    id2word=dict((id, word) for word, id in vect.vocabulary_.items()))

Then you can use this dictionary for tfidf, LSI or LDA models.