Fast n-gram calculation

python nlp nltk n-gram

Since you didn't indicate whether you want word or character-level n-grams, I'm just going to assume the former, without loss of generality.

I also assume you start with a list of tokens, represented by strings. What you can easily do is write n-gram extraction yourself.

def ngrams(tokens, MIN_N, MAX_N):    n_tokens = len(tokens)    for i in xrange(n_tokens):        for j in xrange(i+MIN_N, min(n_tokens, i+MAX_N)+1):            yield tokens[i:j]

Then replace the yield with the actual action you want to take on each n-gram (add it to a dict, store it in a database, whatever) to get rid of the generator overhead.

Finally, if it's really not fast enough, convert the above to Cython and compile it. Example using a defaultdict instead of yield:

def ngrams(tokens, int MIN_N, int MAX_N):    cdef Py_ssize_t i, j, n_tokens    count = defaultdict(int)    join_spaces = " ".join    n_tokens = len(tokens)    for i in xrange(n_tokens):        for j in xrange(i+MIN_N, min(n_tokens, i+MAX_N)+1):            count[join_spaces(tokens[i:j])] += 1    return count

python nlp nltk n-gram

You might find a pythonic, elegant and fast ngram generation function using zip and splat (*) operator here :

def find_ngrams(input_list, n):  return zip(*[input_list[i:] for i in range(n)])

python nlp nltk n-gram

For character-level n-grams you could use the following function

def ngrams(text, n):    n-=1    return [text[i-n:i+1] for i,char in enumerate(text)][n:]

CodeHunter

Fast n-gram calculation

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last