The first step of creating Doc object is breakdown the raw text into smaller pieces or tokens

import spacy
nlp = spacy.load('en_core_web_sm')
mystring = '"We\'re moving to L.A.!"'
print(mystring)
"We're moving to L.A.!"
doc = nlp(mystring)

for token in doc:
    print(token.text, end = " | ")
" | We | 're | moving | to | L.A. | ! | " | 

Spacy follows the following sequence to break the text.

  • Prefix: Character(s) at the beginning ▸ $ ( “ ¿
  • Suffix: Character(s) at the end ▸ km ) , . ! ”
  • Infix: Character(s) in between ▸ - -- / ...
  • Exception: Special-case rule to split a string into several tokens or prevent a token from being split when punctuation rules are applied ▸ `St. U.S.

Tokens are the basic building blocks of a Doc object - everything that helps us understand the meaning of the text is derived from tokens and their relationship to one another.

Prefixes, Suffixes and Infixes

spaCy will isolate punctuation that does not form an integral part of a word. Quotation marks, commas, and punctuation at the end of a sentence will be assigned their own token. However, punctuation that exists as part of an email address, website or numerical value will be kept as part of the token.

doc2 = nlp(u"We're here to help! Send snail-mail, email support@oursite.com or visit us at http://www.oursite.com!")

for t in doc2:
    print(t)
We
're
here
to
help
!
Send
snail
-
mail
,
email
support@oursite.com
or
visit
us
at
http://www.oursite.com
!

Note - We see that dash, !, commas are assigned as seperate tokens. But email address and website and kept together.

doc3 = nlp(u'A 5km NYC cab ride costs $10.30')

for t in doc3:
    print(t)
A
5
km
NYC
cab
ride
costs
$
10.30

Hence the dollar sign and amount are given seperate tokens.

Exceptions

Punctuation that exists as part of a known abbreviation will be kept as part of the token.

doc4 = nlp(u"Let's visit St. Louis in the U.S. next year.")

for t in doc4:
    print(t)
Let
's
visit
St.
Louis
in
the
U.S.
next
year
.

In this case we see abbrevation such St. and U.S are preseved in seperate tokens.

Counting tokens

len(doc)
8

Counting Vocab Entrmies

len(doc.vocab)
802

Retriving tokens by index position and slice

doc5 = nlp(u'It is better to give than to receive.')

### Retrieve the third token:
doc5[2]
better
doc5[2:5]
better to give
 
doc6 = nlp(u'My dinner was horrible.')
doc7 = nlp(u'Your dinner was delicious.')
doc6[3] = doc7[3]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
C:\Users\VICKY~1.CRA\AppData\Local\Temp/ipykernel_11844/3591796456.py in <module>
----> 1 doc6[3] = doc7[3]

TypeError: 'spacy.tokens.doc.Doc' object does not support item assignment

In this case we are trying to replace horrible in doc6 with delicious from doc7. But it cannot be done.

Named Entities

The language model recognizes that certain words are organizational names while others are locations, and still other combinations relate to money, dates, etc. Named entities are accessible through the ents property of a Doc object.

doc8 = nlp(u'Apple to build a Hong Kong factory for $6 million')

for token in doc8:
    print(token.text, end=' | ')

print('\n----')

for ent in doc8.ents:
    print(ent.text+' - '+ent.label_+' - '+str(spacy.explain(ent.label_)))
Apple | to | build | a | Hong | Kong | factory | for | $ | 6 | million | 
----
Hong Kong - GPE - Countries, cities, states
$6 million - MONEY - Monetary values, including unit
 

Doc.noun_chunks are another object property. Noun chunks are "base noun phrases" – flat phrases that have a noun as their head. You can think of noun chunks as a noun plus the words describing the noun – for example, in Sheb Wooley's 1958 song, a "one-eyed, one-horned, flying, purple people-eater" would be one long noun chunk.

doc9 = nlp(u"Autonomous cars shift insurance liability toward manufacturers.")

for chunk in doc9.noun_chunks:
    print(chunk.text)
Autonomous cars
insurance liability
manufacturers
doc10 = nlp(u"Red cars do not carry higher insurance rates.")

for chunk in doc10.noun_chunks:
    print(chunk.text)
Red cars
higher insurance rates
doc11 = nlp(u"He was a one-eyed, one-horned, flying, purple people-eater.")

for chunk in doc11.noun_chunks:
    print(chunk.text)
He
purple people-eater

Built in Visualizers

spaCy includes a built-in visualization tool called displaCy

from spacy import displacy

doc = nlp(u'Apple is going to build a U.K. factory for $6 million.')
displacy.render(doc, style='dep', jupyter=True, options={'distance': 120})
Apple PROPN is AUX going VERB to PART build VERB a DET U.K. PROPN factory NOUN for ADP $ SYM 6 NUM million. NUM nsubj aux aux xcomp det compound dobj prep quantmod compound pobj

The optional 'distance' argument sets the distance between tokens. If the distance is made too small, text that appears beneath short arrows may become too compressed to read.

Visualizing the entity recognizer

doc = nlp(u'Over the last quarter Apple sold nearly 20 thousand iPods for a profit of $6 million.')
displacy.render(doc, style='ent', jupyter=True)
Over the last quarter DATE Apple ORG sold nearly 20 thousand CARDINAL iPods PRODUCT for a profit of $6 million MONEY .