Parts of Speech Assessment

For this assessment we'll be using the short story The Tale of Peter Rabbit by Beatrix Potter (1902).
The story is in the public domain; the text file was obtained from Project Gutenberg.

import spacy
nlp = spacy.load('en_core_web_sm')
from spacy import displacy

1. Create a Doc object from the file peterrabbit.txt

HINT:Use with open('../TextFiles/peterrabbit.txt') as f:

with open('data_files/peterrabbit.txt') as f :
    doc = nlp(f.read())

2. For every token in the third sentence, print the token text, the POS tag, the fine-grained TAG tag, and the description of the fine-grained tag.

They         PRON   PRP    pronoun, personal
lived        VERB   VBD    verb, past tense
with         ADP    IN     conjunction, subordinating or preposition
their        ADJ    PRP$   pronoun, possessive
Mother       PROPN  NNP    noun, proper singular
in           ADP    IN     conjunction, subordinating or preposition
a            DET    DT     determiner
sand         NOUN   NN     noun, singular or mass
-            PUNCT  HYPH   punctuation mark, hyphen
bank         NOUN   NN     noun, singular or mass
,            PUNCT  ,      punctuation mark, comma
underneath   ADP    IN     conjunction, subordinating or preposition
the          DET    DT     determiner
root         NOUN   NN     noun, singular or mass
of           ADP    IN     conjunction, subordinating or preposition
a            DET    DT     determiner

            SPACE         None
very         ADV    RB     adverb
big          ADJ    JJ     adjective
fir          NOUN   NN     noun, singular or mass
-            PUNCT  HYPH   punctuation mark, hyphen
tree         NOUN   NN     noun, singular or mass
.            PUNCT  .      punctuation mark, sentence closer


           SPACE  _SP    None

Solution

sentences = [sent for sent in doc.sents]

for token in sentences[2]:
    print(f'{token.text:{10}}{token.pos_:{10}}{token.tag_:{10}}{spacy.explain(token.tag_)}')


        SPACE     _SP       whitespace
They      PRON      PRP       pronoun, personal
lived     VERB      VBD       verb, past tense
with      ADP       IN        conjunction, subordinating or preposition
their     PRON      PRP$      pronoun, possessive
Mother    PROPN     NNP       noun, proper singular
in        ADP       IN        conjunction, subordinating or preposition
a         DET       DT        determiner
sand      NOUN      NN        noun, singular or mass
-         PUNCT     HYPH      punctuation mark, hyphen
bank      NOUN      NN        noun, singular or mass
,         PUNCT     ,         punctuation mark, comma
underneathADP       IN        conjunction, subordinating or preposition
the       DET       DT        determiner
root      NOUN      NN        noun, singular or mass
of        ADP       IN        conjunction, subordinating or preposition
a         DET       DT        determiner

         SPACE     _SP       whitespace
very      ADV       RB        adverb
big       ADJ       JJ        adjective (English), other noun-modifier (Chinese)
fir       NOUN      NN        noun, singular or mass
-         PUNCT     HYPH      punctuation mark, hyphen
tree      NOUN      NN        noun, singular or mass
.         PUNCT     .         punctuation mark, sentence closer

3. Provide a frequency list of POS tags from the entire document

 ###  type your code

83. ADJ  : 83
84. ADP  : 127
85. ADV  : 75
88. CCONJ: 61
89. DET  : 90
91. NOUN : 176
92. NUM  : 8
93. PART : 36
94. PRON : 72
95. PROPN: 75
96. PUNCT: 174
99. VERB : 182
102. SPACE: 99

Solution

POS_counts = doc.count_by(spacy.attrs.POS)
POS_counts

{90: 91,
 96: 73,
 85: 124,
 97: 171,
 93: 8,
 103: 99,
 86: 63,
 98: 20,
 92: 170,
 95: 108,
 100: 135,
 84: 56,
 89: 61,
 94: 30,
 87: 49}

for key,value in sorted(POS_counts.items()):
    print(f'{key}. {doc.vocab[key].text:{10}}:  {value}')

84. ADJ       :  56
85. ADP       :  124
86. ADV       :  63
87. AUX       :  49
89. CCONJ     :  61
90. DET       :  91
92. NOUN      :  170
93. NUM       :  8
94. PART      :  30
95. PRON      :  108
96. PROPN     :  73
97. PUNCT     :  171
98. SCONJ     :  20
100. VERB      :  135
103. SPACE     :  99

4. CHALLENGE: What percentage of tokens are nouns?
HINT: the attribute ID for 'NOUN' is 91

POS_counts[92]/sum(POS_counts.values())

0.13513513513513514

5. Display the Dependency Parse for the third sentence

displacy.render(sentences[2],style = "dep")

6. Show the first two named entities from Beatrix Potter's The Tale of Peter Rabbit

The Tale of Peter Rabbit - WORK_OF_ART - Titles of books, songs, etc.
Beatrix Potter - PERSON - People, including fictional

def show_ents(docs):
    if doc.ents:
        for ent in doc.ents:
            print(ent.text + ' - '+ent.label_+' - '+ str(spacy.explain(ent.label_)))
        else:
            print("No entity found")

for ent in doc.ents[:2]:
    print(ent.text + ' - '+ent.label_+' - '+ str(spacy.explain(ent.label_)))

Beatrix Potter - PERSON - People, including fictional
1902 - DATE - Absolute or relative dates or periods

7. How many sentences are contained in The Tale of Peter Rabbit?

len(sentences)

54

8. CHALLENGE: How many sentences contain named entities?

 ###  type your code

49

list_of_ners = [doc for doc in sentences if doc.ents]
len(list_of_ners)

23

counter = 0
for sent in sentences:
    if sent.ents:
        counter = counter + 1
    
print(counter)

23

9. CHALLENGE: Display the named entity visualization for list_of_sents[0] from the previous problem

displacy.render(sentences[0], style = "ent")