n-grams in python, four, five, six grams?

python string nltk n-gram

Great native python based answers given by other users. But here's the nltk approach (just in case, the OP gets penalized for reinventing what's already existing in the nltk library).

There is an ngram module that people seldom use in nltk. It's not because it's hard to read ngrams, but training a model base on ngrams where n > 3 will result in much data sparsity.

from nltk import ngramssentence = 'this is a foo bar sentences and i want to ngramize it'n = 6sixgrams = ngrams(sentence.split(), n)for grams in sixgrams:  print grams

python string nltk n-gram

I'm surprised that this hasn't shown up yet:

In [34]: sentence = "I really like python, it's pretty awesome.".split()In [35]: N = 4In [36]: grams = [sentence[i:i+N] for i in xrange(len(sentence)-N+1)]In [37]: for gram in grams: print gram['I', 'really', 'like', 'python,']['really', 'like', 'python,', "it's"]['like', 'python,', "it's", 'pretty']['python,', "it's", 'pretty', 'awesome.']

python string nltk n-gram

Using only nltk tools

from nltk.tokenize import word_tokenizefrom nltk.util import ngramsdef get_ngrams(text, n ):    n_grams = ngrams(word_tokenize(text), n)    return [ ' '.join(grams) for grams in n_grams]

Example output

get_ngrams('This is the simplest text i could think of', 3 )['This is the', 'is the simplest', 'the simplest text', 'simplest text i', 'text i could', 'i could think', 'could think of']

In order to keep the ngrams in array format just remove ' '.join

CodeHunter

n-grams in python, four, five, six grams?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last