How to extract numbers (along with comparison adjectives or ranges) How to extract numbers (along with comparison adjectives or ranges) python python

How to extract numbers (along with comparison adjectives or ranges)


I would probably approach this as a chunking task and use nltk's part of speech tagger combined with its regular expression chunker. This will allow you to define a regular expression based on the part of speech of the words in your sentences instead of on the words themselves. For a given sentence, you can do the following:

import nltk# example sentencesent = 'send me a table with a price greater than $100'

The first thing I would do is to modify your sentences slightly so that you don't confuse the part of speech tagger too much. Here are some examples of changes that you can make (with very simple regular expressions) but you can experiment and see if there are others:

$10 -> 10 dollars200lbs -> 200 lbs5-7 -> 5 - 7 OR 5 to 7

so we get:

sent = 'send me a table with a price greater than 100 dollars'

now you can get the parts of speech from your sentence:

sent_pos = nltk.pos_tag(sent.split())print(sent_pos)[('send', 'VB'), ('me', 'PRP'), ('a', 'DT'), ('table', 'NN'), ('with', 'IN'), ('a', 'DT'), ('price', 'NN'), ('greater', 'JJR'), ('than', 'IN'), ('100', 'CD'), ('dollars', 'NNS')]

We can now create a chunker which will chunk your POS tagged text according to a (relatively) simple regular expression:

grammar = 'NumericalPhrase: {<NN|NNS>?<RB>?<JJR><IN><CD><NN|NNS>?}'parser = nltk.RegexpParser(grammar)

This defines a parser with a grammar that chunks numerical phrases (what we'll call your phrase type). It defines your numerical phrase as: an optional noun, followed by an optional adverb, followed by a comparative adjective, a preposition, a number, and an optional noun. This is just a suggestion for how you may want to define your phrases, but I think that this will be much simpler than using a regular expression on the words themselves.

To get your phrases you can do:

print(parser.parse(sent_pos))(S  send/VB  me/PRP  a/DT  table/NN  with/IN  a/DT  (NumericalPhrase price/NN greater/JJR than/IN 100/CD dollars/NNS))  

Or to get only your phrases you can do:

print([tree.leaves() for tree in parser.parse(sent_pos).subtrees() if tree.label() == 'NumericalPhrase'])[[('price', 'NN'),  ('greater', 'JJR'),  ('than', 'IN'),  ('100', 'CD'),  ('dollars', 'NNS')]]