Converting a String to a List of Words? Converting a String to a List of Words? python python

Converting a String to a List of Words?


Try this:

import remystr = 'This is a string, with words!'wordList = re.sub("[^\w]", " ",  mystr).split()

How it works:

From the docs :

re.sub(pattern, repl, string, count=0, flags=0)

Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl. If the pattern isn’t found, string is returned unchanged. repl can be a string or a function.

so in our case :

pattern is any non-alphanumeric character.

[\w] means any alphanumeric character and is equal to the character set [a-zA-Z0-9_]

a to z, A to Z , 0 to 9 and underscore.

so we match any non-alphanumeric character and replace it with a space .

and then we split() it which splits string by space and converts it to a list

so 'hello-world'

becomes 'hello world'

with re.sub

and then ['hello' , 'world']

after split()

let me know if any doubts come up.


I think this is the simplest way for anyone else stumbling on this post given the late response:

>>> string = 'This is a string, with words!'>>> string.split()['This', 'is', 'a', 'string,', 'with', 'words!']


To do this properly is quite complex. For your research, it is known as word tokenization. You should look at NLTK if you want to see what others have done, rather than starting from scratch:

>>> import nltk>>> paragraph = u"Hi, this is my first sentence. And this is my second.">>> sentences = nltk.sent_tokenize(paragraph)>>> for sentence in sentences:...     nltk.word_tokenize(sentence)[u'Hi', u',', u'this', u'is', u'my', u'first', u'sentence', u'.'][u'And', u'this', u'is', u'my', u'second', u'.']