Fast n-gram calculation Fast n-gram calculation python python

Fast n-gram calculation


Since you didn't indicate whether you want word or character-level n-grams, I'm just going to assume the former, without loss of generality.

I also assume you start with a list of tokens, represented by strings. What you can easily do is write n-gram extraction yourself.

def ngrams(tokens, MIN_N, MAX_N):    n_tokens = len(tokens)    for i in xrange(n_tokens):        for j in xrange(i+MIN_N, min(n_tokens, i+MAX_N)+1):            yield tokens[i:j]

Then replace the yield with the actual action you want to take on each n-gram (add it to a dict, store it in a database, whatever) to get rid of the generator overhead.

Finally, if it's really not fast enough, convert the above to Cython and compile it. Example using a defaultdict instead of yield:

def ngrams(tokens, int MIN_N, int MAX_N):    cdef Py_ssize_t i, j, n_tokens    count = defaultdict(int)    join_spaces = " ".join    n_tokens = len(tokens)    for i in xrange(n_tokens):        for j in xrange(i+MIN_N, min(n_tokens, i+MAX_N)+1):            count[join_spaces(tokens[i:j])] += 1    return count


You might find a pythonic, elegant and fast ngram generation function using zip and splat (*) operator here :

def find_ngrams(input_list, n):  return zip(*[input_list[i:] for i in range(n)])


For character-level n-grams you could use the following function

def ngrams(text, n):    n-=1    return [text[i-n:i+1] for i,char in enumerate(text)][n:]