How to break up document by sentences with Spacy How to break up document by sentences with Spacy python python

How to break up document by sentences with Spacy


The up-to-date answer is this:

from __future__ import unicode_literals, print_functionfrom spacy.lang.en import English # updatedraw_text = 'Hello, world. Here are two sentences.'nlp = English()nlp.add_pipe(nlp.create_pipe('sentencizer')) # updateddoc = nlp(raw_text)sentences = [sent.string.strip() for sent in doc.sents]


From spacy's github support page

from __future__ import unicode_literals, print_functionfrom spacy.en import Englishraw_text = 'Hello, world. Here are two sentences.'nlp = English()doc = nlp(raw_text)sentences = [sent.string.strip() for sent in doc.sents]


Answer

import spacynlp = spacy.load('en_core_web_sm')text = 'My first birthday was great. My 2. was even better.'sentences = [i for i in nlp(text).sents]

Additional info
This assumes that you have already installed the model "en_core_web_sm" on your system. If not, you can easily install it by running the following command in your terminal:

$ python -m spacy download en_core_web_sm

(See here for an overview of all available models.)

Depending on your data this can lead to better results than just using spacy.lang.en.English. One (very simple) comparison example:

import spacyfrom spacy.lang.en import Englishnlp_simple = English()nlp_simple.add_pipe(nlp_simple.create_pipe('sentencizer'))nlp_better = spacy.load('en_core_web_sm')text = 'My first birthday was great. My 2. was even better.'for nlp in [nlp_simple, nlp_better]:    for i in nlp(text).sents:        print(i)    print('-' * 20)

Outputs:

>>> My first birthday was great.>>> My 2.>>> was even better.>>> -------------------->>> My first birthday was great.>>> My 2. was even better.>>> --------------------