2.1 Spacy Basics
The post explains the basics of Spacy library used for NLP
- Installation and setup
- Understanding the spacy pipeline
- Tokenization
- Part of Speech Tagging(POS)
- Dependencies
- Additional Token Attributes
pip install -U spacy
!python -m spacy download en
import spacy
nlp = spacy.load('en_core_web_sm')
Creating a doc object and printing the different components of the token
### Create a doc object
doc = nlp(u'Tesla is looking at buying U.S. startup for $6 million')
### Printing each token seperately
for token in doc:
print(f'{token.text:>{10}} {token.pos_:>{10}} {token.dep_:>{10}}')
When we run nlp
, our text enters a processing pipeline that first breaks down the text and then performs a series of operations to tag, parse and describe the data. Image source: https://spacy.io/usage/spacy-101#pipelines
NAME | COMPONENT | CREATES | DESCRIPTION |
---|---|---|---|
tokenizer | Tokenizer |
Doc |
Segment text into tokens. |
tagger | Tagger |
Token.tag |
Assign part-of-speech tags. |
parser | DependencyParser |
Token.head , Token.dep , Doc.sents , Doc.noun_chunks
|
Assign dependency labels. |
ner | EntityRecognizer |
Doc.ents , Token.ent_iob , Token.ent_type
|
Detect and label named entities. |
lemmatizer | Lemmatizer |
Token.lemma |
Assign base forms. |
textcat | TextCategorizer |
Doc.cats |
Assign document labels. |
custom | custom components |
Doc._.xxx , Token._.xxx , Span._.xxx
|
Assign custom attributes, methods or properties. |
nlp.pipeline
nlp.pipe_names
Tokenization is breaking the raw text into small chunks. Tokenization breaks the raw text into words, sentences called tokens. These tokens help in understanding the context or developing the model for the NLP. The tokenization helps in interpreting the meaning of the text by analyzing the sequence of the words.
Source: https://towardsdatascience.com/tokenization-for-natural-language-processing-a179a891bad4s
doc2 = nlp(u"Apple isn't looking into buying startups.")
for token in doc2:
print(f'{token.text:>{10}} {token.pos_:>{10}} {token.dep_:>{10}}')
Things to take note
- Spacy is able to recognize the root verb and the negation, hence it has split is'nt into two tokens
- Spaces and peroid are assigned as tokens
type(doc2)
doc2 is spacy object and contains information of each token in the text.
In the above example we see the output has clearly label Apple as pronoun, looking as a verb etc.These are parts of speech.
For a full list of POS Tags visit https://spacy.io/api/annotation#pos-tagging
We also looked at the syntactic dependencies assigned to each token. Tesla
is identified as an nsubj
or the nominal subject of the sentence.
For a full list of Syntactic Dependencies visit https://spacy.io/api/annotation#dependency-parsing
A good explanation of typed dependencies can be found here
Full name of a tag used in spacy
spacy.explain('PROPN')
spacy.explain('nsubj')
Tag | Description | doc2[0].tag |
---|---|---|
.text |
The original word text | Tesla |
.lemma_ |
The base form of the word | tesla |
.pos_ |
The simple part-of-speech tag |
PROPN /proper noun
|
.tag_ |
The detailed part-of-speech tag |
NNP /noun, proper singular
|
.shape_ |
The word shape – capitalization, punctuation, digits | Xxxxx |
.is_alpha |
Is the token an alpha character? | True |
.is_stop |
Is the token part of a stop list, i.e. the most common words of the language? | False |
print(doc2[4].text)
print(doc2[4].lemma_)
print(doc2[0].text + ': ' + doc2[0].shape_)
print(doc[5].text + ': ' + doc[5].shape_)
Large Doc objects can be hard to work with at times. A span is a slice of Doc object in the form Doc[start:stop]
.
doc3 = nlp(u'Although commmonly attributed to John Lennon from his song "Beautiful Boy", \
the phrase "Life is what happens to us while we are making other plans" was written by \
cartoonist Allen Saunders and published in Reader\'s Digest in 1957, when Lennon was 17.')
life_qoute = doc3[16:30]
print(life_qoute)
type(life_qoute)
Certain tokens inside a Doc object may also receive a "start of sentence" tag. While this doesn't immediately build a list of sentences, these tags enable the generation of sentence segments through Doc.sents
.
doc4 = nlp(u'This is the first sentence. This is another sentence. This is the last sentence.')
for sent in doc4.sents:
print(sent)
doc4[6].is_sent_start