adding words to stop_words list in TfidfVectorizer in sklearn adding words to stop_words list in TfidfVectorizer in sklearn python python

adding words to stop_words list in TfidfVectorizer in sklearn


This is how you can do it:

from sklearn.feature_extraction import textfrom sklearn.feature_extraction.text import TfidfVectorizermy_stop_words = text.ENGLISH_STOP_WORDS.union(["book"])vectorizer = TfidfVectorizer(ngram_range=(1,1), stop_words=my_stop_words)X = vectorizer.fit_transform(["this is an apple.","this is a book."])idf_values = dict(zip(vectorizer.get_feature_names(), vectorizer.idf_))# printing the tfidf vectorsprint(X)# printing the vocabularyprint(vectorizer.vocabulary_)

In this example, I created the tfidf vectors for two sample documents:

"This is a green apple.""This is a machine learning book."

By default, this, is, a, and an are all in the ENGLISH_STOP_WORDS list. And, I also added book to the stop word list. This is the output:

(0, 1)  0.707106781187(0, 0)  0.707106781187(1, 3)  0.707106781187(1, 2)  0.707106781187{'green': 1, 'machine': 3, 'learning': 2, 'apple': 0}

As we can see, the word book is also removed from the list of features because we listed it as a stop word. As a result, tfidfvectorizer did accept the manually added word as a stop word and ignored the word at the time of creating the vectors.


This is answered here: https://stackoverflow.com/a/24386751/732396

Even though sklearn.feature_extraction.text.ENGLISH_STOP_WORDS is a frozenset, you can make a copy of it and add your own words, then pass that variable in to the stop_words argument as a list.


For use with scikit-learn you can always use a list as-well:

from nltk.corpus import stopwordsstop = list(stopwords.words('english'))stop.extend('myword1 myword2 myword3'.split())vectorizer = TfidfVectorizer(analyzer = 'word',stop_words=set(stop))vectors = vectorizer.fit_transform(corpus)...

The only downside of this method, over a set is that your list may end up containing duplicates, which is why I then convert it back when using it as an argument for TfidfVectorizer