Parts of Speech Assessment

For this assessment we'll be using the short story The Tale of Peter Rabbit by Beatrix Potter (1902).
The story is in the public domain; the text file was obtained from Project Gutenberg.

import spacy
nlp = spacy.load('en_core_web_sm')
from spacy import displacy

1. Create a Doc object from the file peterrabbit.txt

HINT:Use with open('../TextFiles/peterrabbit.txt') as f:

with open('data_files/peterrabbit.txt') as f :
    doc = nlp(f.read())

2. For every token in the third sentence, print the token text, the POS tag, the fine-grained TAG tag, and the description of the fine-grained tag.


They         PRON   PRP    pronoun, personal
lived        VERB   VBD    verb, past tense
with         ADP    IN     conjunction, subordinating or preposition
their        ADJ    PRP$   pronoun, possessive
Mother       PROPN  NNP    noun, proper singular
in           ADP    IN     conjunction, subordinating or preposition
a            DET    DT     determiner
sand         NOUN   NN     noun, singular or mass
-            PUNCT  HYPH   punctuation mark, hyphen
bank         NOUN   NN     noun, singular or mass
,            PUNCT  ,      punctuation mark, comma
underneath   ADP    IN     conjunction, subordinating or preposition
the          DET    DT     determiner
root         NOUN   NN     noun, singular or mass
of           ADP    IN     conjunction, subordinating or preposition
a            DET    DT     determiner

            SPACE         None
very         ADV    RB     adverb
big          ADJ    JJ     adjective
fir          NOUN   NN     noun, singular or mass
-            PUNCT  HYPH   punctuation mark, hyphen
tree         NOUN   NN     noun, singular or mass
.            PUNCT  .      punctuation mark, sentence closer


           SPACE  _SP    None

Solution

sentences = [sent for sent in doc.sents]

for token in sentences[2]:
    print(f'{token.text:{10}}{token.pos_:{10}}{token.tag_:{10}}{spacy.explain(token.tag_)}')

        SPACE     _SP       whitespace
They      PRON      PRP       pronoun, personal
lived     VERB      VBD       verb, past tense
with      ADP       IN        conjunction, subordinating or preposition
their     PRON      PRP$      pronoun, possessive
Mother    PROPN     NNP       noun, proper singular
in        ADP       IN        conjunction, subordinating or preposition
a         DET       DT        determiner
sand      NOUN      NN        noun, singular or mass
-         PUNCT     HYPH      punctuation mark, hyphen
bank      NOUN      NN        noun, singular or mass
,         PUNCT     ,         punctuation mark, comma
underneathADP       IN        conjunction, subordinating or preposition
the       DET       DT        determiner
root      NOUN      NN        noun, singular or mass
of        ADP       IN        conjunction, subordinating or preposition
a         DET       DT        determiner

         SPACE     _SP       whitespace
very      ADV       RB        adverb
big       ADJ       JJ        adjective (English), other noun-modifier (Chinese)
fir       NOUN      NN        noun, singular or mass
-         PUNCT     HYPH      punctuation mark, hyphen
tree      NOUN      NN        noun, singular or mass
.         PUNCT     .         punctuation mark, sentence closer

3. Provide a frequency list of POS tags from the entire document

 ###  type your code
83. ADJ  : 83
84. ADP  : 127
85. ADV  : 75
88. CCONJ: 61
89. DET  : 90
91. NOUN : 176
92. NUM  : 8
93. PART : 36
94. PRON : 72
95. PROPN: 75
96. PUNCT: 174
99. VERB : 182
102. SPACE: 99

Solution

POS_counts = doc.count_by(spacy.attrs.POS)
POS_counts
{90: 91,
 96: 73,
 85: 124,
 97: 171,
 93: 8,
 103: 99,
 86: 63,
 98: 20,
 92: 170,
 95: 108,
 100: 135,
 84: 56,
 89: 61,
 94: 30,
 87: 49}
for key,value in sorted(POS_counts.items()):
    print(f'{key}. {doc.vocab[key].text:{10}}:  {value}')
84. ADJ       :  56
85. ADP       :  124
86. ADV       :  63
87. AUX       :  49
89. CCONJ     :  61
90. DET       :  91
92. NOUN      :  170
93. NUM       :  8
94. PART      :  30
95. PRON      :  108
96. PROPN     :  73
97. PUNCT     :  171
98. SCONJ     :  20
100. VERB      :  135
103. SPACE     :  99

4. CHALLENGE: What percentage of tokens are nouns?
HINT: the attribute ID for 'NOUN' is 91

POS_counts[92]/sum(POS_counts.values())
0.13513513513513514

5. Display the Dependency Parse for the third sentence

displacy.render(sentences[2],style = "dep")
SPACE They PRON lived VERB with ADP their PRON Mother PROPN in ADP a DET sand- NOUN bank, NOUN underneath ADP the DET root NOUN of ADP a DET SPACE very ADV big ADJ fir- NOUN tree. NOUN dep nsubj prep poss pobj prep det compound pobj prep det pobj prep det dep advmod amod compound pobj

6. Show the first two named entities from Beatrix Potter's The Tale of Peter Rabbit

 
The Tale of Peter Rabbit - WORK_OF_ART - Titles of books, songs, etc.
Beatrix Potter - PERSON - People, including fictional
def show_ents(docs):
    if doc.ents:
        for ent in doc.ents:
            print(ent.text + ' - '+ent.label_+' - '+ str(spacy.explain(ent.label_)))
        else:
            print("No entity found")
for ent in doc.ents[:2]:
    print(ent.text + ' - '+ent.label_+' - '+ str(spacy.explain(ent.label_)))
Beatrix Potter - PERSON - People, including fictional
1902 - DATE - Absolute or relative dates or periods

7. How many sentences are contained in The Tale of Peter Rabbit?

len(sentences)
54

8. CHALLENGE: How many sentences contain named entities?

 ###  type your code
49
list_of_ners = [doc for doc in sentences if doc.ents]
len(list_of_ners)
23
counter = 0
for sent in sentences:
    if sent.ents:
        counter = counter + 1
    
print(counter)
23

9. CHALLENGE: Display the named entity visualization for list_of_sents[0] from the previous problem

displacy.render(sentences[0], style = "ent")
The Tale of Peter Rabbit, by Beatrix Potter PERSON ( 1902 DATE ).