2.4 Lemmatization
The post explains the Lemmatization
In contrast to stemming, lemmatization looks beyond word reduction, and considers a language's full vocabulary to apply a morphological analysis to words. The lemma of 'was' is 'be' and the lemma of 'mice' is 'mouse'. Further, the lemma of 'meeting' might be 'meet' or 'meeting' depending on its use in a sentence.
import spacy
nlp = spacy.load('en_core_web_sm')
doc1 = nlp(u"I am a runner running in a race because I love to run since I ran today")
for token in doc1:
print(f'{token.text:<{10}}{token.pos_:<{10}}{token.lemma:<{25}}{token.lemma_:<{10}}')
In this case we see that running, run and ran have the same lemma (12767647472892411841)
def show_lemmas(text):
for token in text:
print(f'{token.text:{12}} {token.pos_:{6}} {token.lemma:<{22}} {token.lemma_}')
doc2 = nlp(u"I saw eighteen mice today!")
show_lemmas(doc2)
doc3 = nlp(u"I am meeting him tomorrow at the meeting.")
show_lemmas(doc3)
Here we see how meeting is correctly tagged as a noun and a verb
doc4 = nlp(u"That's an enormous automobile")
show_lemmas(doc4)
Note that lemmatization does *not* reduce words to their most basic synonym - that is, `enormous` doesn't become `big` and `automobile` doesn't become `car`.