How to compute skipgrams in python?
From the paper that OP links, the following string:
Insurgents killed in ongoing fighting
Yields:
2-skip-bi-grams = {insurgents killed, insurgents in, insurgents ongoing, killed in, killed ongoing, killed fighting, in ongoing, in fighting, ongoing fighting}
2-skip-tri-grams = {insurgents killed in, insurgents killed ongoing, insurgents killed fighting, insurgents in ongoing, insurgents in fighting, insurgents ongoing fighting, killed in ongoing, killed in fighting, killed ongoing fighting, in ongoing fighting}.
With slight modification to NLTK's ngrams
code (https://github.com/nltk/nltk/blob/develop/nltk/util.py#L383):
from itertools import chain, combinationsimport copyfrom nltk.util import ngramsdef pad_sequence(sequence, n, pad_left=False, pad_right=False, pad_symbol=None): if pad_left: sequence = chain((pad_symbol,) * (n-1), sequence) if pad_right: sequence = chain(sequence, (pad_symbol,) * (n-1)) return sequencedef skipgrams(sequence, n, k, pad_left=False, pad_right=False, pad_symbol=None): sequence_length = len(sequence) sequence = iter(sequence) sequence = pad_sequence(sequence, n, pad_left, pad_right, pad_symbol) if sequence_length + pad_left + pad_right < k: raise Exception("The length of sentence + padding(s) < skip") if n < k: raise Exception("Degree of Ngrams (n) needs to be bigger than skip (k)") history = [] nk = n+k # Return point for recursion. if nk < 1: return # If n+k longer than sequence, reduce k by 1 and recur elif nk > sequence_length: for ng in skipgrams(list(sequence), n, k-1): yield ng while nk > 1: # Collects the first instance of n+k length history history.append(next(sequence)) nk -= 1 # Iterative drop first item in history and picks up the next # while yielding skipgrams for each iteration. for item in sequence: history.append(item) current_token = history.pop(0) # Iterates through the rest of the history and # pick out all combinations the n-1grams for idx in list(combinations(range(len(history)), n-1)): ng = [current_token] for _id in idx: ng.append(history[_id]) yield tuple(ng) # Recursively yield the skigrams for the rest of seqeunce where # len(sequence) < n+k for ng in list(skipgrams(history, n, k-1)): yield ng
Let's do some doctest to match the example in the paper:
>>> two_skip_bigrams = list(skipgrams(text, n=2, k=2))[('Insurgents', 'killed'), ('Insurgents', 'in'), ('Insurgents', 'ongoing'), ('killed', 'in'), ('killed', 'ongoing'), ('killed', 'fighting'), ('in', 'ongoing'), ('in', 'fighting'), ('ongoing', 'fighting')]>>> two_skip_trigrams = list(skipgrams(text, n=3, k=2))[('Insurgents', 'killed', 'in'), ('Insurgents', 'killed', 'ongoing'), ('Insurgents', 'killed', 'fighting'), ('Insurgents', 'in', 'ongoing'), ('Insurgents', 'in', 'fighting'), ('Insurgents', 'ongoing', 'fighting'), ('killed', 'in', 'ongoing'), ('killed', 'in', 'fighting'), ('killed', 'ongoing', 'fighting'), ('in', 'ongoing', 'fighting')]
But do note that if n+k > len(sequence)
, it will yield the same effects as skipgrams(sequence, n, k-1)
(this is not a bug, it's a fail safe feature), e.g.
>>> three_skip_trigrams = list(skipgrams(text, n=3, k=3))>>> three_skip_fourgrams = list(skipgrams(text, n=4, k=3))>>> four_skip_fourgrams = list(skipgrams(text, n=4, k=4))>>> four_skip_fivegrams = list(skipgrams(text, n=5, k=4))>>>>>> print len(three_skip_trigrams), three_skip_trigrams10 [('Insurgents', 'killed', 'in'), ('Insurgents', 'killed', 'ongoing'), ('Insurgents', 'killed', 'fighting'), ('Insurgents', 'in', 'ongoing'), ('Insurgents', 'in', 'fighting'), ('Insurgents', 'ongoing', 'fighting'), ('killed', 'in', 'ongoing'), ('killed', 'in', 'fighting'), ('killed', 'ongoing', 'fighting'), ('in', 'ongoing', 'fighting')]>>> print len(three_skip_fourgrams), three_skip_fourgrams 5 [('Insurgents', 'killed', 'in', 'ongoing'), ('Insurgents', 'killed', 'in', 'fighting'), ('Insurgents', 'killed', 'ongoing', 'fighting'), ('Insurgents', 'in', 'ongoing', 'fighting'), ('killed', 'in', 'ongoing', 'fighting')]>>> print len(four_skip_fourgrams), four_skip_fourgrams 5 [('Insurgents', 'killed', 'in', 'ongoing'), ('Insurgents', 'killed', 'in', 'fighting'), ('Insurgents', 'killed', 'ongoing', 'fighting'), ('Insurgents', 'in', 'ongoing', 'fighting'), ('killed', 'in', 'ongoing', 'fighting')]>>> print len(four_skip_fivegrams), four_skip_fivegrams 1 [('Insurgents', 'killed', 'in', 'ongoing', 'fighting')]
This allows n == k
but it disallow n > k
as shown in the lines :
if n < k: raise Exception("Degree of Ngrams (n) needs to be bigger than skip (k)")
For understanding sake, let's try to understand the "mystical" line:
for idx in list(combinations(range(len(history)), n-1)): pass # Do something
Given a list of unique items, combinations produce this:
>>> from itertools import combinations>>> x = [0,1,2,3,4,5]>>> list(combinations(x,2))[(0, 1), (0, 2), (0, 3), (0, 4), (0, 5), (1, 2), (1, 3), (1, 4), (1, 5), (2, 3), (2, 4), (2, 5), (3, 4), (3, 5), (4, 5)]
And since the indices of a list of tokens is always unique, e.g.
>>> sent = ['this', 'is', 'a', 'foo', 'bar']>>> current_token = sent.pop(0) # i.e. 'this'>>> range(len(sent))[0,1,2,3]
It's possible to compute the possible combinations (without replacement) of the range:
>>> n = 3>>> list(combinations(range(len(sent)), n-1))[(0, 1), (0, 2), (0, 3), (1, 2), (1, 3), (2, 3)]
If we map the indices back to the list of tokens:
>>> [tuple(sent[id] for id in idx) for idx in combinations(range(len(sent)), 2)[('is', 'a'), ('is', 'foo'), ('is', 'bar'), ('a', 'foo'), ('a', 'bar'), ('foo', 'bar')]
Then we concatenate with the current_token
, we get the skipgrams for the current token and context+skip window:
>>> [tuple([current_token]) + tuple(sent[id] for id in idx) for idx in combinations(range(len(sent)), 2)][('this', 'is', 'a'), ('this', 'is', 'foo'), ('this', 'is', 'bar'), ('this', 'a', 'foo'), ('this', 'a', 'bar'), ('this', 'foo', 'bar')]
So after that we move on to the next word.
EDITED
The latest NLTK version 3.2.5 has the skipgrams
implemented.
Here's a cleaner implementation from @jnothman from the NLTK repo: https://github.com/nltk/nltk/blob/develop/nltk/util.py#L538
def skipgrams(sequence, n, k, **kwargs): """ Returns all possible skipgrams generated from a sequence of items, as an iterator. Skipgrams are ngrams that allows tokens to be skipped. Refer to http://homepages.inf.ed.ac.uk/ballison/pdf/lrec_skipgrams.pdf :param sequence: the source data to be converted into trigrams :type sequence: sequence or iter :param n: the degree of the ngrams :type n: int :param k: the skip distance :type k: int :rtype: iter(tuple) """ # Pads the sequence as desired by **kwargs. if 'pad_left' in kwargs or 'pad_right' in kwargs: sequence = pad_sequence(sequence, n, **kwargs) # Note when iterating through the ngrams, the pad_right here is not # the **kwargs padding, it's for the algorithm to detect the SENTINEL # object on the right pad to stop inner loop. SENTINEL = object() for ngram in ngrams(sequence, n + k, pad_right=True, right_pad_symbol=SENTINEL): head = ngram[:1] tail = ngram[1:] for skip_tail in combinations(tail, n - 1): if skip_tail[-1] is SENTINEL: continue yield head + skip_tail
[out]:
>>> from nltk.util import skipgrams>>> sent = "Insurgents killed in ongoing fighting".split()>>> list(skipgrams(sent, 2, 2))[('Insurgents', 'killed'), ('Insurgents', 'in'), ('Insurgents', 'ongoing'), ('killed', 'in'), ('killed', 'ongoing'), ('killed', 'fighting'), ('in', 'ongoing'), ('in', 'fighting'), ('ongoing', 'fighting')]>>> list(skipgrams(sent, 3, 2))[('Insurgents', 'killed', 'in'), ('Insurgents', 'killed', 'ongoing'), ('Insurgents', 'killed', 'fighting'), ('Insurgents', 'in', 'ongoing'), ('Insurgents', 'in', 'fighting'), ('Insurgents', 'ongoing', 'fighting'), ('killed', 'in', 'ongoing'), ('killed', 'in', 'fighting'), ('killed', 'ongoing', 'fighting'), ('in', 'ongoing', 'fighting')]
Although this would part entirely from your code and defer it to an external library; you can use Colibri Core (https://proycon.github.io/colibri-core) for skipgram extraction. It's a library written specifically for efficient n-gram and skipgram extraction from big text corpora. The code base is in C++ (for speed/efficiency), but a Python binding is available.
You rightfully mentioned efficiency, as skipgram extraction quickly shows exponential complexity, which may not be a big issue if you only pass a sentence as you did in your input_list
, but becomes problematic if you release it on large corpus data. To mitigate this you can set parameters such as an occurrence threshold, or demand each skip of a skipgram to be fillable by at least x distinct n-grams.
import colibricore#Prepare corpus data (will be encoded for efficiency)corpusfile_plaintext = "somecorpus.txt" #input, one sentence per lineencoder = colibricore.ClassEncoder()encoder.build(corpusfile_plaintext)corpusfile = "somecorpus.colibri.dat" #corpus outputclassfile = "somecorpus.colibri.cls" #class encoding outputencoder.encodefile(corpusfile_plaintext,corpusfile)encoder.save(classfile)#Set options for skipgram extraction (mintokens is the occurrence threshold, maxlength maximum ngram/skipgram length)colibricore.PatternModelOptions(mintokens=2,maxlength=8,doskipgrams=True)#Instantiate an empty pattern model model = colibricore.UnindexedPatternModel()#Train the model on the encoded corpus file (this does the skipgram extraction)model.train(corpusfile, options)#Load a decoder so we can view the outputdecoder = colibricore.ClassDecoder(classfile)#Output all skipgramsfor pattern in model: if pattern.category() == colibricore.Category.SKIPGRAM: print(pattern.tostring(decoder))
There's a more extensive Python tutorial about all this on the website.
Disclaimer: I'm the author of Colibri Core