How to extract numbers (along with comparison adjectives or ranges)

python regex nlp nltk spacy

I would probably approach this as a chunking task and use nltk's part of speech tagger combined with its regular expression chunker. This will allow you to define a regular expression based on the part of speech of the words in your sentences instead of on the words themselves. For a given sentence, you can do the following:

import nltk# example sentencesent = 'send me a table with a price greater than $100'

The first thing I would do is to modify your sentences slightly so that you don't confuse the part of speech tagger too much. Here are some examples of changes that you can make (with very simple regular expressions) but you can experiment and see if there are others:

$10 -> 10 dollars200lbs -> 200 lbs5-7 -> 5 - 7 OR 5 to 7

so we get:

sent = 'send me a table with a price greater than 100 dollars'

now you can get the parts of speech from your sentence:

sent_pos = nltk.pos_tag(sent.split())print(sent_pos)[('send', 'VB'), ('me', 'PRP'), ('a', 'DT'), ('table', 'NN'), ('with', 'IN'), ('a', 'DT'), ('price', 'NN'), ('greater', 'JJR'), ('than', 'IN'), ('100', 'CD'), ('dollars', 'NNS')]

We can now create a chunker which will chunk your POS tagged text according to a (relatively) simple regular expression:

grammar = 'NumericalPhrase: {<NN|NNS>?<RB>?<JJR><IN><CD><NN|NNS>?}'parser = nltk.RegexpParser(grammar)

This defines a parser with a grammar that chunks numerical phrases (what we'll call your phrase type). It defines your numerical phrase as: an optional noun, followed by an optional adverb, followed by a comparative adjective, a preposition, a number, and an optional noun. This is just a suggestion for how you may want to define your phrases, but I think that this will be much simpler than using a regular expression on the words themselves.

To get your phrases you can do:

print(parser.parse(sent_pos))(S  send/VB  me/PRP  a/DT  table/NN  with/IN  a/DT  (NumericalPhrase price/NN greater/JJR than/IN 100/CD dollars/NNS))

Or to get only your phrases you can do:

print([tree.leaves() for tree in parser.parse(sent_pos).subtrees() if tree.label() == 'NumericalPhrase'])[[('price', 'NN'),  ('greater', 'JJR'),  ('than', 'IN'),  ('100', 'CD'),  ('dollars', 'NNS')]]

CodeHunter

How to extract numbers (along with comparison adjectives or ranges)

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last