In spaCy Basics we saw briefly how Doc objects are divided into sentences. In this section we'll learn how sentence segmentation works, and how to set our own segmentation rules.

import spacy
nlp = spacy.load('en_core_web_sm')

doc = nlp(u'This is the first sentence. This is another sentence. This is the last sentence.')

for sent in doc.sents:
    print(sent)

This is the first sentence.
This is another sentence.
This is the last sentence.

`Doc.sents` is a generator

It is important to note that doc.sents is a generator. That is, a Doc is not segmented until doc.sents is called. This means that, where you could print the second Doc token with print(doc[1]), you can't call the "second Doc sentence" with print(doc.sents[1]):

print(doc[1])

is

print(doc.sents[1])

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
C:\Users\VICKY~1.CRA\AppData\Local\Temp/ipykernel_10256/406465868.py in <module>
----> 1 print(doc.sents[1])

TypeError: 'generator' object is not subscriptable

However, you can build a sentence collection by running doc.sents and saving the result to a list:

doc_sents = [sent for sent in doc.sents]
doc_sents

[This is the first sentence.,
 This is another sentence.,
 This is the last sentence.]

**NOTE**: `list(doc.sents)` also works. We show a list comprehension as it allows you to pass in conditionals.

print(doc_sents[1])

This is another sentence.

`sents` are Spans

At first glance it looks like each sent contains text from the original Doc object. In fact they're just Spans with start and end token pointers.

type(doc_sents[1])

spacy.tokens.span.Span

print(doc_sents[1].start, doc_sents[1].end)

6 11

Adding Rules

spaCy's built-in sentencizer relies on the dependency parse and end-of-sentence punctuation to determine segmentation rules. We can add rules of our own, but they have to be added before the creation of the Doc object, as that is where the parsing of segment start tokens happens:

doc2 = nlp(u'This is a sentence. This is a sentence. This is a sentence.')

for token in doc2:
    print(token.is_sent_start, ' '+token.text)

True  This
False  is
False  a
False  sentence
False  .
True  This
False  is
False  a
False  sentence
False  .
True  This
False  is
False  a
False  sentence
False  .

Notice we haven't run `doc2.sents`, and yet `token.is_sent_start` was set to True on two tokens in the Doc.

Let's add a semicolon to our existing segmentation rules. That is, whenever the sentencizer encounters a semicolon, the next token should start a new segment.

doc3 = nlp(u'"Management is doing things right; leadership is doing the right things." -Peter Drucker')

for sent in doc3.sents:
    print(sent)

"Management is doing things right; leadership is doing the right things."
-Peter Drucker

@Language.component("set_custom_boundaries")
def set_custom_boundaries(doc):
    for token in doc[:-1]:
        if token.text == ';':
            doc[token.i+1].is_sent_start = True
    return doc



nlp.add_pipe("set_custom_boundaries", before='parser')

nlp.pipe_names

['tok2vec',
 'tagger',
 'set_custom_boundaries',
 'parser',
 'attribute_ruler',
 'lemmatizer',
 'ner']

doc4 = nlp(u'"Management is doing things right; leadership is doing the right things." -Peter Drucker')

for sent in doc4.sents:
    print(sent)

"Management is doing things right;
leadership is doing the right things."
-Peter Drucker

"Management is doing things right; leadership is doing the right things."
-Peter Drucker

leadership

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
C:\Users\VICKY~1.CRA\AppData\Local\Temp/ipykernel_10256/1944157106.py in <module>
      1 # Try to change the .is_sent_start attribute:
----> 2 doc3[7].is_sent_start = True

~\Anaconda3\lib\site-packages\spacy\tokens\token.pyx in spacy.tokens.token.Token.is_sent_start.__set__()

ValueError: [E043] Refusing to write to token.sent_start if its document is parsed, because this may cause inconsistent state.

['This', 'is', 'a', 'sentence', '.']
['This', 'is', 'another', '.']
['\n\n', 'This', 'is', 'a', '\n', 'third', 'sentence', '.']

['This', 'is', 'a', 'sentence', '.']
['This', 'is', 'another', '.']
['\n\n', 'This', 'is', 'a', '\n', 'third', 'sentence', '.']

['This', 'is', 'a', 'sentence', '.', 'This', 'is', 'another', '.', '\n\n']
['This', 'is', 'a', '\n']
['third', 'sentence', '.']

doc4 = nlp(u'"Management is doing things right; leadership is doing the right things." -Peter Drucker')

for sent in doc4.sents:
    print(sent)

"Management is doing things right;
leadership is doing the right things."
-Peter Drucker

for sent in doc3.sents:
    print(sent)

"Management is doing things right; leadership is doing the right things."
-Peter Drucker

Why not change the token directly?

Why not simply set the .is_sent_start value to True on existing tokens?

doc3[7]

leadership

doc3[7].is_sent_start = True

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
C:\Users\VICKY~1.CRA\AppData\Local\Temp/ipykernel_10256/1944157106.py in <module>
      1 # Try to change the .is_sent_start attribute:
----> 2 doc3[7].is_sent_start = True

~\Anaconda3\lib\site-packages\spacy\tokens\token.pyx in spacy.tokens.token.Token.is_sent_start.__set__()

ValueError: [E043] Refusing to write to token.sent_start if its document is parsed, because this may cause inconsistent state.

spaCy refuses to change the tag after the document is parsed to prevent inconsistencies in the data.

Changing the Rules

In some cases we want to replace spaCy's default sentencizer with our own set of rules. In this section we'll see how the default sentencizer breaks on periods. We'll then replace this behavior with a sentencizer that breaks on linebreaks.

nlp = spacy.load('en_core_web_sm')  # reset to the original

mystring = u"This is a sentence. This is another.\n\nThis is a \nthird sentence."

# SPACY DEFAULT BEHAVIOR:
doc = nlp(mystring)

for sent in doc.sents:
    print([token.text for token in sent])

['This', 'is', 'a', 'sentence', '.']
['This', 'is', 'another', '.']
['\n\n', 'This', 'is', 'a', '\n', 'third', 'sentence', '.']

#from spacy.language import Language

#def split_on_newlines(doc):
#    start = 0
#    seen_newline = False
#    for word in doc:
#        if seen_newline:
#            yield doc[start:word.i]
#            start = word.i
#            seen_newline = False
#        elif word.text.startswith('\n'): # handles multiple occurrences
#            seen_newline = True
#    yield doc[start:]      # handles the last group of tokens


#sbd = SentenceSegmenter(nlp.vocab, strategy=split_on_newlines)
#nlp.add_pipe(sbd)

#sbd = NewLineSegmenter(nlp.vocab, strategy=split_on_newlines)
#nlp.add_pipe(sbd, name='sentence_segmenter', before='parser')
#doc = nlp(my_doc_text)

from spacy.language import Language
 
@Language.component('split_on_newlines')
def split_on_newlines(doc):
    for tok in doc[1:]:
        tok.is_sent_start = doc[tok.i - 1].text.startswith('\n')
    return doc
 
nlp = spacy.load('en_core_web_sm')  # reset to the original
nlp.add_pipe('split_on_newlines', before='parser')
 
#doc = nlp("1\n\n3")
for sent in doc.sents:
    print([token.text for token in sent])

['This', 'is', 'a', 'sentence', '.']
['This', 'is', 'another', '.']
['\n\n', 'This', 'is', 'a', '\n', 'third', 'sentence', '.']

While the function `split_on_newlines` can be named anything we want, it's important to use the name `sbd` for the SentenceSegmenter.

doc = nlp(mystring)
for sent in doc.sents:
    print([token.text for token in sent])

['This', 'is', 'a', 'sentence', '.', 'This', 'is', 'another', '.', '\n\n']
['This', 'is', 'a', '\n']
['third', 'sentence', '.']

Doc.sents is a generator

sents are Spans

Adding Rules

Why not change the token directly?

Changing the Rules

`Doc.sents` is a generator

`sents` are Spans