3.1 Part of Speech Basics
The post explains the basics of Spacy library used for NLP
- View token tags
- Coarse-grained Part-of-speech Tags
- Fine-grained Part-of-speech Tags
- Working with POS Tags
- Counting POS Tags
- Fine-grained POS Tag Examples
The challenge of correctly identifying parts of speech is summed up nicely in the spaCy docs:
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(u"The quick brown fox jumped over the lazy dog's back.")
Recall that you can obtain a particular token by its index position.
- To view the coarse POS tag use
token.pos_
- To view the fine-grained tag use
token.tag_
- To view the description of either type of tag use
spacy.explain(tag)
print(doc.text)
print(doc[4].text, doc[4].pos_, doc[4].tag_, spacy.explain(doc[4].tag_))
We can apply this technique to the entire Doc object:
for token in doc:
print(f'{token.text:{10}} {token.pos_:{8}} {token.tag_:{6}} {spacy.explain(token.tag_)}')
Every token is assigned a POS Tag from the following list:
POS | DESCRIPTION | EXAMPLES | ||
---|---|---|---|---|
ADJ | adjective | *big, old, green, incomprehensible, first* | ||
ADP | adposition | *in, to, during* | ||
ADV | adverb | *very, tomorrow, down, where, there* | ||
AUX | auxiliary | *is, has (done), will (do), should (do)* | ||
CONJ | conjunction | *and, or, but* | ||
CCONJ | coordinating conjunction | *and, or, but* | ||
DET | determiner | *a, an, the* | ||
INTJ | interjection | *psst, ouch, bravo, hello* | ||
NOUN | noun | *girl, cat, tree, air, beauty* | ||
NUM | numeral | *1, 2017, one, seventy-seven, IV, MMXIV* | ||
PART | particle | *'s, not,* | ||
PRON | pronoun | *I, you, he, she, myself, themselves, somebody* | ||
PROPN | proper noun | *Mary, John, London, NATO, HBO* | ||
PUNCT | punctuation | *., (, ), ?* | ||
SCONJ | subordinating conjunction | *if, while, that* | ||
SYM | symbol | *$, %, §, ©, +, −, ×, ÷, =, :), 😝* | ||
VERB | verb | *run, runs, running, eat, ate, eating* | ||
X | other | *sfpksdpsxmsa* | ||
SPACE | space |
POS | Description | Fine-grained Tag | Description | Morphology |
---|---|---|---|---|
ADJ | adjective | AFX | affix | Hyph=yes |
ADJ | JJ | adjective | Degree=pos | |
ADJ | JJR | adjective, comparative | Degree=comp | |
ADJ | JJS | adjective, superlative | Degree=sup | |
ADJ | PDT | predeterminer | AdjType=pdt PronType=prn | |
ADJ | PRP\$ | pronoun, possessive | PronType=prs Poss=yes | |
ADJ | WDT | wh-determiner | PronType=int rel | |
ADJ | WP\$ | wh-pronoun, possessive | Poss=yes PronType=int rel | |
ADP | adposition | IN | conjunction, subordinating or preposition | |
ADV | adverb | EX | existential there | AdvType=ex |
ADV | RB | adverb | Degree=pos | |
ADV | RBR | adverb, comparative | Degree=comp | |
ADV | RBS | adverb, superlative | Degree=sup | |
ADV | WRB | wh-adverb | PronType=int rel | |
CONJ | conjunction | CC | conjunction, coordinating | ConjType=coor |
DET | determiner | DT | determiner | |
INTJ | interjection | UH | interjection | |
NOUN | noun | NN | noun, singular or mass | Number=sing |
NOUN | NNS | noun, plural | Number=plur | |
NOUN | WP | wh-pronoun, personal | PronType=int rel | |
NUM | numeral | CD | cardinal number | NumType=card |
PART | particle | POS | possessive ending | Poss=yes |
PART | RP | adverb, particle | ||
PART | TO | infinitival to | PartType=inf VerbForm=inf | |
PRON | pronoun | PRP | pronoun, personal | PronType=prs |
PROPN | proper noun | NNP | noun, proper singular | NounType=prop Number=sign |
PROPN | NNPS | noun, proper plural | NounType=prop Number=plur | |
PUNCT | punctuation | -LRB- | left round bracket | PunctType=brck PunctSide=ini |
PUNCT | -RRB- | right round bracket | PunctType=brck PunctSide=fin | |
PUNCT | , | punctuation mark, comma | PunctType=comm | |
PUNCT | : | punctuation mark, colon or ellipsis | ||
PUNCT | . | punctuation mark, sentence closer | PunctType=peri | |
PUNCT | '' | closing quotation mark | PunctType=quot PunctSide=fin | |
PUNCT | "" | closing quotation mark | PunctType=quot PunctSide=fin | |
PUNCT | `` | opening quotation mark | PunctType=quot PunctSide=ini | |
PUNCT | HYPH | punctuation mark, hyphen | PunctType=dash | |
PUNCT | LS | list item marker | NumType=ord | |
PUNCT | NFP | superfluous punctuation | ||
SYM | symbol | # | symbol, number sign | SymType=numbersign |
SYM | \$ | symbol, currency | SymType=currency | |
SYM | SYM | symbol | ||
VERB | verb | BES | auxiliary "be" | |
VERB | HVS | forms of "have" | ||
VERB | MD | verb, modal auxiliary | VerbType=mod | |
VERB | VB | verb, base form | VerbForm=inf | |
VERB | VBD | verb, past tense | VerbForm=fin Tense=past | |
VERB | VBG | verb, gerund or present participle | VerbForm=part Tense=pres Aspect=prog | |
VERB | VBN | verb, past participle | VerbForm=part Tense=past Aspect=perf | |
VERB | VBP | verb, non-3rd person singular present | VerbForm=fin Tense=pres | |
VERB | VBZ | verb, 3rd person singular present | VerbForm=fin Tense=pres Number=sing Person=3 | |
X | other | ADD | ||
X | FW | foreign word | Foreign=yes | |
X | GW | additional word in multi-word expression | ||
X | XX | unknown | ||
SPACE | space | _SP | space | |
NIL | missing tag |
For a current list of tags for all languages visit https://spacy.io/api/annotation#pos-tagging
In the English language, the same string of characters can have different meanings, even within the same sentence. For this reason, morphology is important. spaCy uses machine learning algorithms to best predict the use of a token in a sentence. Is "I read books on NLP" present or past tense? Is wind a verb or a noun?
doc = nlp(u'I read books on NLP.')
r = doc[1]
print(f'{r.text:{10}} {r.pos_:{8}} {r.tag_:{6}} {spacy.explain(r.tag_)}')
doc = nlp(u'I read a book on NLP.')
r = doc[1]
print(f'{r.text:{10}} {r.pos_:{8}} {r.tag_:{6}} {spacy.explain(r.tag_)}')
In the first example, with no other cues to work from, spaCy assumed that read was present tense.
In the second example the present tense form would be I am reading a book, so spaCy assigned the past tense.
The Doc.count_by()
method accepts a specific token attribute as its argument, and returns a frequency count of the given attribute as a dictionary object. Keys in the dictionary are the integer values of the given attribute ID, and values are the frequency. Counts of zero are not included.
doc = nlp(u"The quick brown fox jumped over the lazy dog's back.")
# Count the frequencies of different coarse-grained POS tags:
POS_counts = doc.count_by(spacy.attrs.POS)
POS_counts
This isn't very helpful until you decode the attribute ID:
doc.vocab[83].text
Since POS_counts
returns a dictionary, we can obtain a list of keys with POS_counts.items()
.
By sorting the list we have access to the tag and its count, in order.
POS_counts.items()
for k,v in sorted(POS_counts.items()):
print(f'{k}. {doc.vocab[k].text:{5}}: {v}')
TAG_counts = doc.count_by(spacy.attrs.TAG)
for k,v in sorted(TAG_counts.items()):
print(f'{k}. {doc.vocab[k].text:{4}}: {v}')
DEP_counts = doc.count_by(spacy.attrs.DEP)
for k,v in sorted(DEP_counts.items()):
print(f'{k}. {doc.vocab[k].text:{4}}: {v}')
Here we've shown spacy.attrs.POS
, spacy.attrs.TAG
and spacy.attrs.DEP
.
Refer back to the Vocabulary and Matching lecture from the previous section for a table of Other token attributes.
These are some grammatical examples (shown in bold) of specific fine-grained tags. We've removed punctuation and rarely used tags:
POS | TAG | DESCRIPTION | EXAMPLE |
---|---|---|---|
ADJ | AFX | affix | The Flintstones were a **pre**-historic family. |
ADJ | JJ | adjective | This is a **good** sentence. |
ADJ | JJR | adjective, comparative | This is a **better** sentence. |
ADJ | JJS | adjective, superlative | This is the **best** sentence. |
ADJ | PDT | predeterminer | Waking up is **half** the battle. |
ADJ | PRP\$ | pronoun, possessive | **His** arm hurts. |
ADJ | WDT | wh-determiner | It's blue, **which** is odd. |
ADJ | WP\$ | wh-pronoun, possessive | We don't know **whose** it is. |
ADP | IN | conjunction, subordinating or preposition | It arrived **in** a box. |
ADV | EX | existential there | **There** is cake. |
ADV | RB | adverb | He ran **quickly**. |
ADV | RBR | adverb, comparative | He ran **quicker**. |
ADV | RBS | adverb, superlative | He ran **fastest**. |
ADV | WRB | wh-adverb | **When** was that? |
CONJ | CC | conjunction, coordinating | The balloon popped **and** everyone jumped. |
DET | DT | determiner | **This** is **a** sentence. |
INTJ | UH | interjection | **Um**, I don't know. |
NOUN | NN | noun, singular or mass | This is a **sentence**. |
NOUN | NNS | noun, plural | These are **words**. |
NOUN | WP | wh-pronoun, personal | **Who** was that? |
NUM | CD | cardinal number | I want **three** things. |
PART | POS | possessive ending | Fred**'s** name is short. |
PART | RP | adverb, particle | Put it **back**! |
PART | TO | infinitival to | I want **to** go. |
PRON | PRP | pronoun, personal | **I** want **you** to go. |
PROPN | NNP | noun, proper singular | **Kilroy** was here. |
PROPN | NNPS | noun, proper plural | The **Flintstones** were a pre-historic family. |
VERB | MD | verb, modal auxiliary | This **could** work. |
VERB | VB | verb, base form | I want to **go**. |
VERB | VBD | verb, past tense | This **was** a sentence. |
VERB | VBG | verb, gerund or present participle | I am **going**. |
VERB | VBN | verb, past participle | The treasure was **lost**. |
VERB | VBP | verb, non-3rd person singular present | I **want** to go. |
VERB | VBZ | verb, 3rd person singular present | He **wants** to go. |