Counting bigrams (pair of two words) in a file using Python
Some itertools
magic:
>>> import re>>> from itertools import islice, izip>>> words = re.findall("\w+", "the quick person did not realize his speed and the quick person bumped")>>> print Counter(izip(words, islice(words, 1, None)))
Output:
Counter({('the', 'quick'): 2, ('quick', 'person'): 2, ('person', 'did'): 1, ('did', 'not'): 1, ('not', 'realize'): 1, ('and', 'the'): 1, ('speed', 'and'): 1, ('person', 'bumped'): 1, ('his', 'speed'): 1, ('realize', 'his'): 1})
Bonus
Get the frequency of any n-gram:
from itertools import tee, islicedef ngrams(lst, n): tlst = lst while True: a, b = tee(tlst) l = tuple(islice(a, n)) if len(l) == n: yield l next(b) tlst = b else: break>>> Counter(ngrams(words, 3))
Output:
Counter({('the', 'quick', 'person'): 2, ('and', 'the', 'quick'): 1, ('realize', 'his', 'speed'): 1, ('his', 'speed', 'and'): 1, ('person', 'did', 'not'): 1, ('quick', 'person', 'did'): 1, ('quick', 'person', 'bumped'): 1, ('did', 'not', 'realize'): 1, ('speed', 'and', 'the'): 1, ('not', 'realize', 'his'): 1})
This works with lazy iterables and generators too. So you can write a generator which reads a file line by line, generating words, and pass it to ngarms
to consume lazily without reading the whole file in memory.
You can simply use Counter
for any n_gram like so:
from collections import Counterfrom nltk.util import ngrams text = "the quick person did not realize his speed and the quick person bumped "n_gram = 2Counter(ngrams(text.split(), n_gram))>>>Counter({('and', 'the'): 1, ('did', 'not'): 1, ('his', 'speed'): 1, ('not', 'realize'): 1, ('person', 'bumped'): 1, ('person', 'did'): 1, ('quick', 'person'): 2, ('realize', 'his'): 1, ('speed', 'and'): 1, ('the', 'quick'): 2})
For 3-grams, just change the n_gram
to 3:
n_gram = 3Counter(ngrams(text.split(), n_gram))>>>Counter({('and', 'the', 'quick'): 1, ('did', 'not', 'realize'): 1, ('his', 'speed', 'and'): 1, ('not', 'realize', 'his'): 1, ('person', 'did', 'not'): 1, ('quick', 'person', 'bumped'): 1, ('quick', 'person', 'did'): 1, ('realize', 'his', 'speed'): 1, ('speed', 'and', 'the'): 1, ('the', 'quick', 'person'): 2})