Installation and setup

pip install -U spacy

Downloading spacy vocab library

!python -m spacy download en

Loading Spacy

import spacy
nlp = spacy.load('en_core_web_sm')

Creating a doc object and printing the different components of the token

### Create a doc object
doc = nlp(u'Tesla is looking at buying U.S. startup for $6 million')

### Printing each token seperately
for token in doc:
    print(f'{token.text:>{10}} {token.pos_:>{10}} {token.dep_:>{10}}')

     Tesla       NOUN      nsubj
        is        AUX        aux
   looking       VERB       ROOT
        at        ADP       prep
    buying       VERB      pcomp
      U.S.      PROPN   compound
   startup       NOUN       dobj
       for        ADP       prep
         $        SYM   quantmod
         6        NUM   compound
   million        NUM       pobj

Understanding the spacy pipeline

When we run nlp, our text enters a processing pipeline that first breaks down the text and then performs a series of operations to tag, parse and describe the data. Image source: https://spacy.io/usage/spacy-101#pipelines

NAME	COMPONENT	CREATES	DESCRIPTION
tokenizer	`Tokenizer`	`Doc`	Segment text into tokens.
tagger	`Tagger`	`Token.tag`	Assign part-of-speech tags.
parser	`DependencyParser`	`Token.head`, `Token.dep`, `Doc.sents`, `Doc.noun_chunks`	Assign dependency labels.
ner	`EntityRecognizer`	`Doc.ents`, `Token.ent_iob`, `Token.ent_type`	Detect and label named entities.
lemmatizer	`Lemmatizer`	`Token.lemma`	Assign base forms.
textcat	`TextCategorizer`	`Doc.cats`	Assign document labels.
custom	custom components	`Doc._.xxx`, `Token._.xxx`, `Span._.xxx`	Assign custom attributes, methods or properties.

nlp.pipeline

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x23264bfa220>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x23264bfae80>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x23264a554a0>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x23264cb5780>),
 ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x23264cc8780>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x23264a55350>)]

nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

Tokenization

Tokenization is breaking the raw text into small chunks. Tokenization breaks the raw text into words, sentences called tokens. These tokens help in understanding the context or developing the model for the NLP. The tokenization helps in interpreting the meaning of the text by analyzing the sequence of the words.

Source: https://towardsdatascience.com/tokenization-for-natural-language-processing-a179a891bad4s

doc2 = nlp(u"Apple isn't  looking into buying startups.")
for token in doc2:
    print(f'{token.text:>{10}} {token.pos_:>{10}} {token.dep_:>{10}}')

     Apple      PROPN      nsubj
        is        AUX        aux
       n't       PART        neg
                SPACE        dep
   looking       VERB       ROOT
      into        ADP       prep
    buying       VERB      pcomp
  startups       NOUN       dobj
         .      PUNCT      punct

Things to take note

Spacy is able to recognize the root verb and the negation, hence it has split is'nt into two tokens
Spaces and peroid are assigned as tokens

type(doc2)

spacy.tokens.doc.Doc

doc2 is spacy object and contains information of each token in the text.

Part of Speech Tagging(POS)

In the above example we see the output has clearly label Apple as pronoun, looking as a verb etc.These are parts of speech.

For a full list of POS Tags visit https://spacy.io/api/annotation#pos-tagging

Dependencies

We also looked at the syntactic dependencies assigned to each token. Tesla is identified as an nsubj or the nominal subject of the sentence.

For a full list of Syntactic Dependencies visit https://spacy.io/api/annotation#dependency-parsing
A good explanation of typed dependencies can be found here

Full name of a tag used in spacy

spacy.explain('PROPN')

'proper noun'

spacy.explain('nsubj')

'nominal subject'

Additional Token Attributes

Tag	Description	doc2[0].tag
`.text`	The original word text	`Tesla`
`.lemma_`	The base form of the word	`tesla`
`.pos_`	The simple part-of-speech tag	`PROPN`/`proper noun`
`.tag_`	The detailed part-of-speech tag	`NNP`/`noun, proper singular`
`.shape_`	The word shape – capitalization, punctuation, digits	`Xxxxx`
`.is_alpha`	Is the token an alpha character?	`True`
`.is_stop`	Is the token part of a stop list, i.e. the most common words of the language?	`False`

print(doc2[4].text)
print(doc2[4].lemma_)

looking
look

print(doc2[0].text + ': ' + doc2[0].shape_)
print(doc[5].text + ': ' + doc[5].shape_)

Apple: Xxxxx
U.S.: X.X.

Large Doc objects can be hard to work with at times. A span is a slice of Doc object in the form Doc[start:stop].

doc3 = nlp(u'Although commmonly attributed to John Lennon from his song "Beautiful Boy", \
the phrase "Life is what happens to us while we are making other plans" was written by \
cartoonist Allen Saunders and published in Reader\'s Digest in 1957, when Lennon was 17.')

life_qoute = doc3[16:30]
print(life_qoute)

"Life is what happens to us while we are making other plans"

type(life_qoute)

spacy.tokens.span.Span

Certain tokens inside a Doc object may also receive a "start of sentence" tag. While this doesn't immediately build a list of sentences, these tags enable the generation of sentence segments through Doc.sents.

doc4 = nlp(u'This is the first sentence. This is another sentence. This is the last sentence.')

for sent in doc4.sents:
    print(sent)

This is the first sentence.
This is another sentence.
This is the last sentence.

doc4[6].is_sent_start

True