How can I split a text into sentences? How can I split a text into sentences? python python

How can I split a text into sentences?


The Natural Language Toolkit (nltk.org) has what you need. This group posting indicates this does it:

import nltk.datatokenizer = nltk.data.load('tokenizers/punkt/english.pickle')fp = open("test.txt")data = fp.read()print '\n-----\n'.join(tokenizer.tokenize(data))

(I haven't tried it!)


This function can split the entire text of Huckleberry Finn into sentences in about 0.1 seconds and handles many of the more painful edge cases that make sentence parsing non-trivial e.g. "Mr. John Johnson Jr. was born in the U.S.A but earned his Ph.D. in Israel before joining Nike Inc. as an engineer. He also worked at craigslist.org as a business analyst."

# -*- coding: utf-8 -*-import realphabets= "([A-Za-z])"prefixes = "(Mr|St|Mrs|Ms|Dr)[.]"suffixes = "(Inc|Ltd|Jr|Sr|Co)"starters = "(Mr|Mrs|Ms|Dr|He\s|She\s|It\s|They\s|Their\s|Our\s|We\s|But\s|However\s|That\s|This\s|Wherever)"acronyms = "([A-Z][.][A-Z][.](?:[A-Z][.])?)"websites = "[.](com|net|org|io|gov)"def split_into_sentences(text):    text = " " + text + "  "    text = text.replace("\n"," ")    text = re.sub(prefixes,"\\1<prd>",text)    text = re.sub(websites,"<prd>\\1",text)    if "Ph.D" in text: text = text.replace("Ph.D.","Ph<prd>D<prd>")    text = re.sub("\s" + alphabets + "[.] "," \\1<prd> ",text)    text = re.sub(acronyms+" "+starters,"\\1<stop> \\2",text)    text = re.sub(alphabets + "[.]" + alphabets + "[.]" + alphabets + "[.]","\\1<prd>\\2<prd>\\3<prd>",text)    text = re.sub(alphabets + "[.]" + alphabets + "[.]","\\1<prd>\\2<prd>",text)    text = re.sub(" "+suffixes+"[.] "+starters," \\1<stop> \\2",text)    text = re.sub(" "+suffixes+"[.]"," \\1<prd>",text)    text = re.sub(" " + alphabets + "[.]"," \\1<prd>",text)    if "”" in text: text = text.replace(".”","”.")    if "\"" in text: text = text.replace(".\"","\".")    if "!" in text: text = text.replace("!\"","\"!")    if "?" in text: text = text.replace("?\"","\"?")    text = text.replace(".",".<stop>")    text = text.replace("?","?<stop>")    text = text.replace("!","!<stop>")    text = text.replace("<prd>",".")    sentences = text.split("<stop>")    sentences = sentences[:-1]    sentences = [s.strip() for s in sentences]    return sentences


Instead of using regex for spliting the text into sentences, you can also use nltk library.

>>> from nltk import tokenize>>> p = "Good morning Dr. Adams. The patient is waiting for you in room number 3.">>> tokenize.sent_tokenize(p)['Good morning Dr. Adams.', 'The patient is waiting for you in room number 3.']

ref: https://stackoverflow.com/a/9474645/2877052


matomo