Converting a String to a List of Words?

python string list words text-segmentation

Try this:

import remystr = 'This is a string, with words!'wordList = re.sub("[^\w]", " ",  mystr).split()

How it works:

From the docs :

re.sub(pattern, repl, string, count=0, flags=0)

Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl. If the pattern isn’t found, string is returned unchanged. repl can be a string or a function.

so in our case :

pattern is any non-alphanumeric character.

[\w] means any alphanumeric character and is equal to the character set [a-zA-Z0-9_]

a to z, A to Z , 0 to 9 and underscore.

so we match any non-alphanumeric character and replace it with a space .

and then we split() it which splits string by space and converts it to a list

so 'hello-world'

becomes 'hello world'

with re.sub

and then ['hello' , 'world']

after split()

let me know if any doubts come up.

python string list words text-segmentation

I think this is the simplest way for anyone else stumbling on this post given the late response:

>>> string = 'This is a string, with words!'>>> string.split()['This', 'is', 'a', 'string,', 'with', 'words!']

python string list words text-segmentation

To do this properly is quite complex. For your research, it is known as word tokenization. You should look at NLTK if you want to see what others have done, rather than starting from scratch:

>>> import nltk>>> paragraph = u"Hi, this is my first sentence. And this is my second.">>> sentences = nltk.sent_tokenize(paragraph)>>> for sentence in sentences:...     nltk.word_tokenize(sentence)[u'Hi', u',', u'this', u'is', u'my', u'first', u'sentence', u'.'][u'And', u'this', u'is', u'my', u'second', u'.']

CodeHunter

Converting a String to a List of Words?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last