Computing N Grams using Python Computing N Grams using Python python python

Computing N Grams using Python


A short Pythonesque solution from this blog:

def find_ngrams(input_list, n):  return zip(*[input_list[i:] for i in range(n)])

Usage:

>>> input_list = ['all', 'this', 'happened', 'more', 'or', 'less']>>> find_ngrams(input_list, 1)[('all',), ('this',), ('happened',), ('more',), ('or',), ('less',)]>>> find_ngrams(input_list, 2)[('all', 'this'), ('this', 'happened'), ('happened', 'more'), ('more', 'or'), ('or', 'less')]>>> find_ngrams(input_list, 3))[('all', 'this', 'happened'), ('this', 'happened', 'more'), ('happened', 'more', 'or'), ('more', 'or', 'less')]


Assuming input is a string contains space separated words, like x = "a b c d" you can use the following function (edit: see the last function for a possibly more complete solution):

def ngrams(input, n):    input = input.split(' ')    output = []    for i in range(len(input)-n+1):        output.append(input[i:i+n])    return outputngrams('a b c d', 2) # [['a', 'b'], ['b', 'c'], ['c', 'd']]

If you want those joined back into strings, you might call something like:

[' '.join(x) for x in ngrams('a b c d', 2)] # ['a b', 'b c', 'c d']

Lastly, that doesn't summarize things into totals, so if your input was 'a a a a', you need to count them up into a dict:

for g in (' '.join(x) for x in ngrams(input, 2)):    grams.setdefault(g, 0)    grams[g] += 1

Putting that all together into one final function gives:

def ngrams(input, n):   input = input.split(' ')   output = {}   for i in range(len(input)-n+1):       g = ' '.join(input[i:i+n])       output.setdefault(g, 0)       output[g] += 1    return outputngrams('a a a a', 2) # {'a a': 3}


Use NLTK (the Natural Language Toolkit) and use the functions to tokenize (split) your text into a list and then find bigrams and trigrams.

import nltkwords = nltk.word_tokenize(my_text)my_bigrams = nltk.bigrams(words)my_trigrams = nltk.trigrams(words)