2.5 Stop Words
The post explains the basics of Spacy library used for NLP
- Print the set of spaCy's default stop words
- Check if a word is stop word or not
- Removing a stop word from the default list.
Words like "a" and "the" appear so frequently that they don't require tagging as thoroughly as nouns, verbs and modifiers. We call these stop words, and they can be filtered from the text to be processed. spaCy holds a built-in list of some 305 English stop words
import spacy
nlp = spacy.load('en_core_web_sm')
print(nlp.Defaults.stop_words)
len(nlp.Defaults.stop_words)
nlp.vocab['myself'].is_stop
nlp.vocab['elephant'].is_stop
Step 1 - Add the word to the set of stop words. Use lowercase!
nlp.Defaults.stop_words.add('btw')
Step 2 - Set the stop word tag on the lexeme
nlp.vocab['btw'].is_stop = True
len(nlp.Defaults.stop_words)
nlp.vocab['btw'].is_stop
When adding stop words, always use lowercase. Lexemes are converted to lowercase before being added to **vocab**.
Step 1 - Remove the word from the set of stop words
nlp.Defaults.stop_words.remove('beyond')
Step 2 - Remove the stop_word tag from the lexeme
nlp.vocab['beyond'].is_stop = False
len(nlp.Defaults.stop_words)
nlp.vocab['beyond'].is_stop