Installation and setup

pip install -U spacy

Downloading spacy vocab library

!python -m spacy download en

Loading Spacy

import spacy
nlp = spacy.load('en_core_web_sm')

Creating a doc object and printing the different components of the token

### Create a doc object
doc = nlp(u'Tesla is looking at buying U.S. startup for $6 million')

### Printing each token seperately
for token in doc:
    print(f'{token.text:>{10}} {token.pos_:>{10}} {token.dep_:>{10}}')
     Tesla       NOUN      nsubj
        is        AUX        aux
   looking       VERB       ROOT
        at        ADP       prep
    buying       VERB      pcomp
      U.S.      PROPN   compound
   startup       NOUN       dobj
       for        ADP       prep
         $        SYM   quantmod
         6        NUM   compound
   million        NUM       pobj

Understanding the spacy pipeline

When we run nlp, our text enters a processing pipeline that first breaks down the text and then performs a series of operations to tag, parse and describe the data. Image source: https://spacy.io/usage/spacy-101#pipelines

NAME COMPONENT CREATES DESCRIPTION
tokenizer Tokenizer Doc Segment text into tokens.
tagger Tagger Token.tag Assign part-of-speech tags.
parser DependencyParser Token.headToken.depDoc.sentsDoc.noun_chunks Assign dependency labels.
ner EntityRecognizer Doc.entsToken.ent_iobToken.ent_type Detect and label named entities.
lemmatizer Lemmatizer Token.lemma Assign base forms.
textcat TextCategorizer Doc.cats Assign document labels.
custom custom components Doc._.xxxToken._.xxxSpan._.xxx Assign custom attributes, methods or properties.
nlp.pipeline
[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x23264bfa220>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x23264bfae80>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x23264a554a0>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x23264cb5780>),
 ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x23264cc8780>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x23264a55350>)]
nlp.pipe_names
['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

Tokenization

Tokenization is breaking the raw text into small chunks. Tokenization breaks the raw text into words, sentences called tokens. These tokens help in understanding the context or developing the model for the NLP. The tokenization helps in interpreting the meaning of the text by analyzing the sequence of the words.

Source: https://towardsdatascience.com/tokenization-for-natural-language-processing-a179a891bad4s

doc2 = nlp(u"Apple isn't  looking into buying startups.")
for token in doc2:
    print(f'{token.text:>{10}} {token.pos_:>{10}} {token.dep_:>{10}}')
     Apple      PROPN      nsubj
        is        AUX        aux
       n't       PART        neg
                SPACE        dep
   looking       VERB       ROOT
      into        ADP       prep
    buying       VERB      pcomp
  startups       NOUN       dobj
         .      PUNCT      punct

Things to take note

  • Spacy is able to recognize the root verb and the negation, hence it has split is'nt into two tokens
  • Spaces and peroid are assigned as tokens
type(doc2)
spacy.tokens.doc.Doc

doc2 is spacy object and contains information of each token in the text.

Part of Speech Tagging(POS)

In the above example we see the output has clearly label Apple as pronoun, looking as a verb etc.These are parts of speech.

For a full list of POS Tags visit https://spacy.io/api/annotation#pos-tagging

Dependencies

We also looked at the syntactic dependencies assigned to each token. Tesla is identified as an nsubj or the nominal subject of the sentence.

For a full list of Syntactic Dependencies visit https://spacy.io/api/annotation#dependency-parsing
A good explanation of typed dependencies can be found here

Full name of a tag used in spacy

spacy.explain('PROPN')
'proper noun'
spacy.explain('nsubj')
'nominal subject'

Additional Token Attributes

Tag Description doc2[0].tag
.text The original word text Tesla
.lemma_ The base form of the word tesla
.pos_ The simple part-of-speech tag PROPN/proper noun
.tag_ The detailed part-of-speech tag NNP/noun, proper singular
.shape_ The word shape – capitalization, punctuation, digits Xxxxx
.is_alpha Is the token an alpha character? True
.is_stop Is the token part of a stop list, i.e. the most common words of the language? False
print(doc2[4].text)
print(doc2[4].lemma_)
looking
look
print(doc2[0].text + ': ' + doc2[0].shape_)
print(doc[5].text + ': ' + doc[5].shape_)
Apple: Xxxxx
U.S.: X.X.
 

Large Doc objects can be hard to work with at times. A span is a slice of Doc object in the form Doc[start:stop].

doc3 = nlp(u'Although commmonly attributed to John Lennon from his song "Beautiful Boy", \
the phrase "Life is what happens to us while we are making other plans" was written by \
cartoonist Allen Saunders and published in Reader\'s Digest in 1957, when Lennon was 17.')
life_qoute = doc3[16:30]
print(life_qoute)
"Life is what happens to us while we are making other plans"
type(life_qoute)
spacy.tokens.span.Span
 

Certain tokens inside a Doc object may also receive a "start of sentence" tag. While this doesn't immediately build a list of sentences, these tags enable the generation of sentence segments through Doc.sents.

doc4 = nlp(u'This is the first sentence. This is another sentence. This is the last sentence.')
for sent in doc4.sents:
    print(sent)
This is the first sentence.
This is another sentence.
This is the last sentence.
doc4[6].is_sent_start
True