3.5 Sentence Segmentation
The post explains the sentence segmentation in spacy
In spaCy Basics we saw briefly how Doc objects are divided into sentences. In this section we'll learn how sentence segmentation works, and how to set our own segmentation rules.
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(u'This is the first sentence. This is another sentence. This is the last sentence.')
for sent in doc.sents:
print(sent)
print(doc[1])
print(doc.sents[1])
However, you can build a sentence collection by running doc.sents
and saving the result to a list:
doc_sents = [sent for sent in doc.sents]
doc_sents
**NOTE**: `list(doc.sents)` also works. We show a list comprehension as it allows you to pass in conditionals.
print(doc_sents[1])
type(doc_sents[1])
print(doc_sents[1].start, doc_sents[1].end)
doc2 = nlp(u'This is a sentence. This is a sentence. This is a sentence.')
for token in doc2:
print(token.is_sent_start, ' '+token.text)
Notice we haven't run `doc2.sents`, and yet `token.is_sent_start` was set to True on two tokens in the Doc.
Let's add a semicolon to our existing segmentation rules. That is, whenever the sentencizer encounters a semicolon, the next token should start a new segment.
doc3 = nlp(u'"Management is doing things right; leadership is doing the right things." -Peter Drucker')
for sent in doc3.sents:
print(sent)
@Language.component("set_custom_boundaries")
def set_custom_boundaries(doc):
for token in doc[:-1]:
if token.text == ';':
doc[token.i+1].is_sent_start = True
return doc
nlp.add_pipe("set_custom_boundaries", before='parser')
nlp.pipe_names
The new rule has to run before the document is parsed. Here we can either pass the argument spaCy refuses to change the tag after the document is parsed to prevent inconsistencies in the data. While the function `split_on_newlines` can be named anything we want, it's important to use the name `sbd` for the SentenceSegmenter.before='parser'
or first=True
.</p>
</div>
</div>
</div>
doc4 = nlp(u'"Management is doing things right; leadership is doing the right things." -Peter Drucker')
for sent in doc4.sents:
print(sent)
for sent in doc3.sents:
print(sent)
doc3[7]
doc3[7].is_sent_start = True
nlp = spacy.load('en_core_web_sm') # reset to the original
mystring = u"This is a sentence. This is another.\n\nThis is a \nthird sentence."
# SPACY DEFAULT BEHAVIOR:
doc = nlp(mystring)
for sent in doc.sents:
print([token.text for token in sent])
#from spacy.language import Language
#def split_on_newlines(doc):
# start = 0
# seen_newline = False
# for word in doc:
# if seen_newline:
# yield doc[start:word.i]
# start = word.i
# seen_newline = False
# elif word.text.startswith('\n'): # handles multiple occurrences
# seen_newline = True
# yield doc[start:] # handles the last group of tokens
#sbd = SentenceSegmenter(nlp.vocab, strategy=split_on_newlines)
#nlp.add_pipe(sbd)
#sbd = NewLineSegmenter(nlp.vocab, strategy=split_on_newlines)
#nlp.add_pipe(sbd, name='sentence_segmenter', before='parser')
#doc = nlp(my_doc_text)
from spacy.language import Language
@Language.component('split_on_newlines')
def split_on_newlines(doc):
for tok in doc[1:]:
tok.is_sent_start = doc[tok.i - 1].text.startswith('\n')
return doc
nlp = spacy.load('en_core_web_sm') # reset to the original
nlp.add_pipe('split_on_newlines', before='parser')
#doc = nlp("1\n\n3")
for sent in doc.sents:
print([token.text for token in sent])
doc = nlp(mystring)
for sent in doc.sents:
print([token.text for token in sent])