2.7 NLP Basics Practice
The post is a practice on the topics learnt in chapter 2
For this assessment we'll be using the short story An Occurrence at Owl Creek Bridge by Ambrose Bierce (1890).
The story is in the public domain; the text file was obtained from Project Gutenberg.
import spacy
nlp = spacy.load('en_core_web_sm')
1. Create a Doc object from the file owlcreek.txt
HINT:Use
with open('../TextFiles/owlcreek.txt') as f:
doc[:36]
Solution
with open('data_files/owlcreek.txt') as f:
doc = nlp(f.read())
print(doc[:36])
2. How many tokens are contained in the file?
Solution
len(doc)
3. How many sentences are contained in the file?
HINT: You'll want to build a list first!
Solution
sentences = []
for sent in doc.sents:
sentences.append(sent)
len(sentences)
sents= [sent for sent in doc.sents]
len(sents)
4. Print the second sentence in the document
HINT: Indexing starts at zero, and the title counts as the first sentence.
Solution
sentences[0].text
5. For each token in the sentence above, print its text
, POS
tag, dep
tag and lemma
CHALLENGE: Have values line up in columns in the print output.
**Solution**
for token in sentences[0]:
#print(token.text,token.pos_,token.tag_,token.lemma_)
print(f'{token.text:>{10}}{token.pos_:>{10}}{token.dep_:>{10}}{token.lemma_:>{10}}')
6. Write a matcher called 'Swimming' that finds both occurrences of the phrase "swimming vigorously" in the text
HINT: You should include an 'IS_SPACE': True
pattern between the two words!
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)
Solution
pattern = [{'LOWER': 'swimming'},{'IS_SPACE':True,'OP':'*'},{'LOWER':'vigorously'}]
matcher.add('Swimming',None, pattern)
Note Check the update in the code - https://stackoverflow.com/questions/70321680/typeerror-add-takes-exactly-2-positional-arguments-3-given
pattern = [{'LOWER': 'swimming'}, {'IS_SPACE': True, 'OP':'*'}, {'LOWER': 'vigorously'}]
matcher.add('Swimming',[pattern])
found_matches = matcher(doc)
print(found_matches)
7. Print the text surrounding each found match
Solution
print(doc[1265:1290])
print(doc[3600:3615])
EXTRA CREDIT:
Print the sentence that contains each found match
Solution
for sent in sentences:
if found_matches[0][1]<sent.end:
print(sent)
break
for sent in sentences:
if found_matches[1][1]<sent.end:
print(sent)
break