Counting bigrams (pair of two words) in a file using Python Counting bigrams (pair of two words) in a file using Python python python

Counting bigrams (pair of two words) in a file using Python


Some itertools magic:

>>> import re>>> from itertools import islice, izip>>> words = re.findall("\w+",    "the quick person did not realize his speed and the quick person bumped")>>> print Counter(izip(words, islice(words, 1, None)))

Output:

Counter({('the', 'quick'): 2, ('quick', 'person'): 2, ('person', 'did'): 1,   ('did', 'not'): 1, ('not', 'realize'): 1, ('and', 'the'): 1,   ('speed', 'and'): 1, ('person', 'bumped'): 1, ('his', 'speed'): 1,   ('realize', 'his'): 1})

Bonus

Get the frequency of any n-gram:

from itertools import tee, islicedef ngrams(lst, n):  tlst = lst  while True:    a, b = tee(tlst)    l = tuple(islice(a, n))    if len(l) == n:      yield l      next(b)      tlst = b    else:      break>>> Counter(ngrams(words, 3))

Output:

Counter({('the', 'quick', 'person'): 2, ('and', 'the', 'quick'): 1,   ('realize', 'his', 'speed'): 1, ('his', 'speed', 'and'): 1,   ('person', 'did', 'not'): 1, ('quick', 'person', 'did'): 1,   ('quick', 'person', 'bumped'): 1, ('did', 'not', 'realize'): 1,   ('speed', 'and', 'the'): 1, ('not', 'realize', 'his'): 1})

This works with lazy iterables and generators too. So you can write a generator which reads a file line by line, generating words, and pass it to ngarms to consume lazily without reading the whole file in memory.


How about zip()?

import refrom collections import Counterwords = re.findall('\w+', open('a.txt').read())print(Counter(zip(words,words[1:])))


You can simply use Counter for any n_gram like so:

from collections import Counterfrom nltk.util import ngrams text = "the quick person did not realize his speed and the quick person bumped "n_gram = 2Counter(ngrams(text.split(), n_gram))>>>Counter({('and', 'the'): 1,         ('did', 'not'): 1,         ('his', 'speed'): 1,         ('not', 'realize'): 1,         ('person', 'bumped'): 1,         ('person', 'did'): 1,         ('quick', 'person'): 2,         ('realize', 'his'): 1,         ('speed', 'and'): 1,         ('the', 'quick'): 2})

For 3-grams, just change the n_gram to 3:

n_gram = 3Counter(ngrams(text.split(), n_gram))>>>Counter({('and', 'the', 'quick'): 1,         ('did', 'not', 'realize'): 1,         ('his', 'speed', 'and'): 1,         ('not', 'realize', 'his'): 1,         ('person', 'did', 'not'): 1,         ('quick', 'person', 'bumped'): 1,         ('quick', 'person', 'did'): 1,         ('realize', 'his', 'speed'): 1,         ('speed', 'and', 'the'): 1,         ('the', 'quick', 'person'): 2})