Python TfidfVectorizer throwing : empty vocabulary; perhaps the documents only contain stop words" Python TfidfVectorizer throwing : empty vocabulary; perhaps the documents only contain stop words" python python

Python TfidfVectorizer throwing : empty vocabulary; perhaps the documents only contain stop words"


I guess it's because you just have one string. Try splitting it into a list of strings, e.g.:

In [51]: smallcorpOut[51]: 'Ah! Now I have done Philosophy,\nI have finished Law and Medicine,\nAnd sadly even Theology:\nTaken fierce pains, from end to end.\nNow here I am, a fool for sure!\nNo wiser than I was before:'In [52]: tf = TfidfVectorizer()In [53]: tf.fit_transform(smallcorp.split('\n'))Out[53]: <6x28 sparse matrix of type '<type 'numpy.float64'>'    with 31 stored elements in Compressed Sparse Row format>


In version 0.12, we set the minimum document frequency to 2, which means that only words that appear at least twice will be considered. For your example to work, you need to set min_df=1. Since 0.13, this is the default setting.So I guess you are using 0.12, right?


You can alternatively put your single string as a tuple, if you insist to have only one string. Instead of having:

smallcorp = "your text"

you'd rather put it within a tuple.

In [22]: smallcorp = ("your text",)In [23]: tf.fit_transform(smallcorp)Out[23]: <1x2 sparse matrix of type '<type 'numpy.float64'>'    with 2 stored elements in Compressed Sparse Row format>