Stopword removal with NLTK

python nlp nltk stop-words

There is an in-built stopword list in NLTK made up of 2,400 stopwords for 11 languages (Porter et al), see http://nltk.org/book/ch02.html

>>> from nltk import word_tokenize>>> from nltk.corpus import stopwords>>> stop = set(stopwords.words('english'))>>> sentence = "this is a foo bar sentence">>> print([i for i in sentence.lower().split() if i not in stop])['foo', 'bar', 'sentence']>>> [i for i in word_tokenize(sentence.lower()) if i not in stop] ['foo', 'bar', 'sentence']

I recommend looking at using tf-idf to remove stopwords, see Effects of Stemming on the term frequency?

python nlp nltk stop-words

I suggest you create your own list of operator words that you take out of the stopword list. Sets can be conveniently subtracted, so:

operators = set(('and', 'or', 'not'))stop = set(stopwords...) - operators

Then you can simply test if a word is in or not in the set without relying on whether your operators are part of the stopword list. You can then later switch to another stopword list or add an operator.

if word.lower() not in stop:    # use word

python nlp nltk stop-words

@alvas's answer does the job but it can be done way faster. Assuming that you have documents: a list of strings.

from nltk.corpus import stopwordsfrom nltk.tokenize import wordpunct_tokenizestop_words = set(stopwords.words('english'))stop_words.update(['.', ',', '"', "'", '?', '!', ':', ';', '(', ')', '[', ']', '{', '}']) # remove it if you need punctuation for doc in documents:    list_of_words = [i.lower() for i in wordpunct_tokenize(doc) if i.lower() not in stop_words]

Notice that due to the fact that here you are searching in a set (not in a list) the speed would be theoretically len(stop_words)/2 times faster, which is significant if you need to operate through many documents.

For 5000 documents of approximately 300 words each the difference is between 1.8 seconds for my example and 20 seconds for @alvas's.

P.S. in most of the cases you need to divide the text into words to perform some other classification tasks for which tf-idf is used. So most probably it would be better to use stemmer as well:

from nltk.stem.porter import PorterStemmerporter = PorterStemmer()

and to use [porter.stem(i.lower()) for i in wordpunct_tokenize(doc) if i.lower() not in stop_words] inside of a loop.

CodeHunter

Stopword removal with NLTK

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last