3.6 Part of Speech - Practice
The post provide additional practice on the concepts covering this chapter.
For this assessment we'll be using the short story The Tale of Peter Rabbit by Beatrix Potter (1902).
The story is in the public domain; the text file was obtained from Project Gutenberg.
import spacy
nlp = spacy.load('en_core_web_sm')
from spacy import displacy
1. Create a Doc object from the file peterrabbit.txt
HINT:Use
with open('../TextFiles/peterrabbit.txt') as f:
with open('data_files/peterrabbit.txt') as f :
doc = nlp(f.read())
2. For every token in the third sentence, print the token text, the POS tag, the fine-grained TAG tag, and the description of the fine-grained tag.
Solution
sentences = [sent for sent in doc.sents]
for token in sentences[2]:
print(f'{token.text:{10}}{token.pos_:{10}}{token.tag_:{10}}{spacy.explain(token.tag_)}')
3. Provide a frequency list of POS tags from the entire document
### type your code
Solution
POS_counts = doc.count_by(spacy.attrs.POS)
POS_counts
for key,value in sorted(POS_counts.items()):
print(f'{key}. {doc.vocab[key].text:{10}}: {value}')
4. CHALLENGE: What percentage of tokens are nouns?
HINT: the attribute ID for 'NOUN' is 91
POS_counts[92]/sum(POS_counts.values())
5. Display the Dependency Parse for the third sentence
displacy.render(sentences[2],style = "dep")
6. Show the first two named entities from Beatrix Potter's The Tale of Peter Rabbit
def show_ents(docs):
if doc.ents:
for ent in doc.ents:
print(ent.text + ' - '+ent.label_+' - '+ str(spacy.explain(ent.label_)))
else:
print("No entity found")
for ent in doc.ents[:2]:
print(ent.text + ' - '+ent.label_+' - '+ str(spacy.explain(ent.label_)))
7. How many sentences are contained in The Tale of Peter Rabbit?
len(sentences)
8. CHALLENGE: How many sentences contain named entities?
### type your code
list_of_ners = [doc for doc in sentences if doc.ents]
len(list_of_ners)
counter = 0
for sent in sentences:
if sent.ents:
counter = counter + 1
print(counter)
9. CHALLENGE: Display the named entity visualization for list_of_sents[0]
from the previous problem
displacy.render(sentences[0], style = "ent")