Removing non-English words from text using Python

You can use the words corpus from NLTK:

import nltkwords = set(nltk.corpus.words.words())sent = "Io andiamo to the beach with my amico."" ".join(w for w in nltk.wordpunct_tokenize(sent) \         if w.lower() in words or not w.isalpha())# 'Io to the beach with my'

Unfortunately, Io happens to be an English word. In general, it may be hard to decide whether a word is English or not.

python data-science data-cleaning

In MAC OSX it still can show an exception if you try this code. So make sure you download the words corpus manually. Once you import your nltk library, make you might as in mac os it does not download the words corpus automatically. So you have to download it potentially otherwise you will face exception.

import nltk nltk.download('words')words = set(nltk.corpus.words.words())

Now you can perform same execution as previous person directed.

sent = "Io andiamo to the beach with my amico."sent = " ".join(w for w in nltk.wordpunct_tokenize(sent) if w.lower() in words or not w.isalpha())

According to NLTK documentation it doesn't say so. But I got a issue over github and solved that way and it really works. If you don't put the word parameter there, you OSX can logg off and happen again and again.

python data-science data-cleaning

from nltk.stem.snowball import SnowballStemmersnow_stemmer = SnowballStemmer(language='english')  #list of wordswords = ['cared', 'caring', 'careful']  #stem of each wordstem_words = []for w in words:    x = snow_stemmer.stem(w)    stem_words.append(x)      #stemming resultsfor w1,s1 in zip(words,stem_words):    print(w1+' ----> '+s1)

CodeHunter

Removing non-English words from text using Python

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last