Writing a tokenizer in Python

python regex token tokenize nltk

As tokenizing is easy in Python, I'm wondering what your module is planned to provide.I mean when starting a piece of software a good design rather comes from thinking about the usage scenarios than considering data structures first.

Your examples for expected output are a bit confusing.I assume you want the tokenizers return name on left side and a list of tokens on right side.I played a bit to achieve similar results, but using lists for easier handling:

import re# some tokenizersdef tokzr_WORD(txt): return ('WORD', re.findall(r'(?ms)\W*(\w+)', txt))  # split wordsdef tokzr_SENT(txt): return ('SENTENCE', re.findall(r'(?ms)\s*(.*?(?:\.|\?|!))', txt))  # split sentencesdef tokzr_QA(txt):    l_qa = []    for m in re.finditer(r'(?ms)^[\s#\-\*]*(?:Q|Question)\s*:\s*(?P<QUESTION>\S.*?\?)[\s#\-\*]+(?:A|Answer)\s*:\s*(?P<ANSWER>\S.*?)$', txt):  # split (Q, A) sequences        for k in ['QUESTION', 'ANSWER']:            l_qa.append(m.groupdict()[k])    return ('QA', l_qa)def tokzr_QA_non_canonical(txt):  # Note: not supported by tokenize_recursively() as not canonical.    l_qa = []    for m in re.finditer(r'(?ms)^[\s#\-\*]*(?:Q|Question)\s*:\s*(?P<QUESTION>\S.*?\?)[\s#\-\*]+(?:A|Answer)\s*:\s*(?P<ANSWER>\S.*?)$', txt):  # split (Q, A) sequences        for k in ['QUESTION', 'ANSWER']:            l_qa.append((k, m.groupdict()[k]))    return l_qadict_tokzr = {  # control string: tokenizer function    'WORD'    : tokzr_WORD,    'SENTENCE': tokzr_SENT,    'QA'      : tokzr_QA,}# the core functiondef tokenize_recursively(l_tokzr, work_on, lev=0):    if isinstance(work_on, basestring):        ctrl, work_on = dict_tokzr[l_tokzr[0]](work_on)  # tokenize    else:        ctrl, work_on = work_on[0], work_on[1:]  # get right part    ret = [ctrl]    if len(l_tokzr) == 1:        ret.append(work_on)  # add right part    else:        for wo in work_on:  # dive into tree            t = tokenize_recursively(l_tokzr[1:], wo, lev + 1)            ret.append(t)    return ret# just for printingdef nestedListLines(aList, ind='    ', d=0):    """ Returns multi-line string representation of \param aList.  Use \param ind to indent per level. """    sRet = '\n' + d * ind + '['    nested = 0    for i, e in enumerate(aList):        if i:            sRet += ', '        if type(e) == type(aList):            sRet += nestedListLines(e, ind, d + 1)            nested = 1        else:            sRet += '\n' + (d + 1) * ind + repr(e) if nested else repr(e)    sRet += '\n' + d * ind + ']' if nested else ']'    return sRet# main()inp1 = """    * Question: I want try something.  Should I?    * Answer  : I'd assume so.  Give it a try."""inp2 = inp1 + 'Q: What is a good way to achieve this?  A: I am not so sure. I think I will use Python.'print repr(tokzr_WORD(inp1))print repr(tokzr_SENT(inp1))print repr(tokzr_QA(inp1))print repr(tokzr_QA_non_canonical(inp1))  # Really this way?printfor ctrl, inp in [  # example control sequences    ('SENTENCE-WORD', inp1),    ('QA-SENTENCE', inp2)]:    res = tokenize_recursively(ctrl.split('-'), inp)    print nestedListLines(res)

Btw. Python/Lib/tokenize.py (for Python code itself) might be worth a look how to handle things.

python regex token tokenize nltk

If I understand the question correctly then I do think you should reinvent the wheel. I would implement state machines for the different types of tokenization you want and use python dictionaries for saving the tokens.

http://en.wikipedia.org/wiki/Finite-state_machine

Example state machine that will take a sentence with spaces and print out the words, of course you could do this specific example in easier ways! But with state machines in general you get linear time performance and can costumize it easily!

while 1:    if state == "start":        if i == len(text):            state = "end"        elif text[i] == " ":            state = "new word"            i = i - 1        else:            word.append(text[i])    elif state == "new word":        print(''.join(word))        del word[:]        state = "start"    elif state == "end":        print(''.join(word))        break    i = i + 1

http://docs.python.org/2/library/collections.html#collections.Counter

Then you can for example use this python data structure for saving your tokens. I think it's perfectly suited for your needs!

Hope this was some help.

CodeHunter

Writing a tokenizer in Python

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last