Using scikit-learn vectorizers and vocabularies with gensim

python scikit-learn topic-modeling gensim

Gensim doesn't require Dictionary objects. You can use your plain dict as input to id2word directly, as long as it maps ids (integers) to words (strings).

In fact anything dict-like will do (including dict, Dictionary, SqliteDict...).

(Btw gensim's Dictionary is a simple Python dict underneath.Not sure where your remarks on Dictionary performance come from, you can't get a mapping much faster than a plain dict in Python. Maybe you're confusing it with text preprocessing (not part of gensim), which can indeed be slow.)

python scikit-learn topic-modeling gensim

Just to provide with a final example, scikit-learn's vectorizers objects can be transformad into gensim's corpus format with Sparse2Corpus while the vocabulary dict can be recycled by simply swapping keys and values:

# transform sparse matrix into gensim corpuscorpus_vect_gensim = gensim.matutils.Sparse2Corpus(corpus_vect, documents_columns=False)# transform scikit vocabulary into gensim dictionaryvocabulary_gensim = {}for key, val in vect.vocabulary_.items():    vocabulary_gensim[val] = key

python scikit-learn topic-modeling gensim

I am also running some code experiments using these two. Apparently there's a way to construct the dictionary from corpus now

from gensim.corpora.dictionary import Dictionarydictionary = Dictionary.from_corpus(corpus_vect_gensim,                                    id2word=dict((id, word) for word, id in vect.vocabulary_.items()))

Then you can use this dictionary for tfidf, LSI or LDA models.

CodeHunter

Using scikit-learn vectorizers and vocabularies with gensim

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last