World's Best AI Learning Platform with profoundly Demanding Certification Programs
Designed by IITians, only for AI Learners.
Internship Partner

In Association with
In collaboration with



Designed by IITians, only for AI Learners.
Internship Partner
In Association with
In collaboration with
New to InsideAIML? Create an account
Employer? Create an account
Designed by IITians, only for AI Learners.
Internship Partner
In Association with
In collaboration with
Enter your email below and we will send a message to reset your password
Designed by IITians, only for AI Learners.
Internship Partner
In Association with
In collaboration with
By providing your contact details, you agree to our Terms of Use & Privacy Policy.
Already have an account? Sign In
Designed by IITians, only for AI Learners.
Internship Partner
In Association with
In collaboration with
By providing your contact details, you agree to our Terms of Use & Privacy Policy.
Already have an account? Sign In
Download our e-book of Introduction To Python
4.5 (1,292 Ratings)
589 Learners
Shashank Shanu
2 years ago
pip install spacy
import spacy
nlp=spacy.load("en_core_web_sm")
nlp
<spacy.lang.en.English at 0x7fa543f7d850>
# Crete document
my_text = "India (Hindi: Bhārat), officially the Republic of India (Hindi: Bhārat Gaṇarājya),[23] is a country in South Asia. It is the second-most populous country, the seventh-largest country by land area, and the most populous democracy in the world. Bounded by the Indian Ocean on the south, the Arabian Sea on the southwest, and the Bay of Bengal on the southeast, it shares land borders with Pakistan to the west;[f] China, Nepal, and Bhutan to the north; and Bangladesh and Myanmar to the east. In the Indian Ocean, India is in the vicinity of Sri Lanka and the Maldives; its Andaman and Nicobar Islands share a maritime border with Thailand and Indonesia."
my_doc = nlp(my_text)
type(my_doc)
print()
print(my_doc)
India (Hindi: Bhārat), officially the Republic of India (Hindi: Bhārat Gaṇarājya),[23] is a country in South Asia. It is the second-most populous country, the seventh-largest country by land area, and the most populous democracy in the world. Bounded by the Indian Ocean on the south, the Arabian Sea on the southwest, and the Bay of Bengal on the southeast, it shares land borders with Pakistan to the west;[f] China, Nepal, and Bhutan to the north; and Bangladesh and Myanmar to the east. In the Indian Ocean, India is in the vicinity of Sri Lanka and the Maldives; its Andaman and Nicobar Islands share a maritime border with Thailand and Indonesia.
# Printing the tokens of a my_doc
for token in my_doc:
print(token.text)
India
(
Hindi
:
Bhārat
)
,
officially
the
Republic
of
India
(
Hindi
:
Bhārat
Gaṇarājya),[23
]
is
a
country
in
South
Asia
.
It
is
the
second
-
most
populous
country
,
the
seventh
-
largest
country
by
land
area
,
and
the
most
populous
democracy
in
the
world
.
Bounded
by
the
Indian
Ocean
on
the
south
,
the
Arabian
Sea
on
the
southwest
,
and
the
Bay
of
Bengal
on
the
southeast
,
it
shares
land
borders
with
Pakistan
to
the
west;[f
]
China
,
Nepal
,
and
Bhutan
to
the
north
;
and
Bangladesh
and
Myanmar
to
the
east
.
In
the
Indian
Ocean
,
India
is
in
the
vicinity
of
Sri
Lanka
and
the
Maldives
;
its
Andaman
and
Nicobar
Islands
share
a
maritime
border
with
Thailand
and
Indonesia
.
# Printing tokens and boolean values stored in different attributes
for token in my_doc:
print(token.text,'--',token.is_stop,'---',token.is_punct)
India -- false --- false
( -- false --- true
Hindi -- false --- false
: -- false --- true
Bhārat -- false --- false
) -- false --- true
, -- false --- true
officially -- false --- false
the -- true --- false
Republic -- false --- false
of -- true --- false
India -- false --- false
( -- false --- true
Hindi -- false --- false
: -- false --- true
Bhārat -- false --- false
Gaṇarājya),[23 -- false --- false
] -- false --- true
is -- true --- false
a -- true --- false
country -- false --- false
in -- true --- false
South -- false --- false
Asia -- false --- false
. -- false --- true
It -- true --- false
is -- true --- false
the -- true --- false
second -- false --- false
- -- false --- true
most -- true --- false
populous -- false --- false
country -- false --- false
, -- false --- true
the -- true --- false
seventh -- false --- false
- -- false --- true
largest -- false --- false
country -- false --- false
by -- true --- false
land -- false --- false
area -- false --- false
, -- false --- true
and -- true --- false
the -- true --- false
most -- true --- false
populous -- false --- false
democracy -- false --- false
in -- true --- false
the -- true --- false
world -- false --- false
. -- false --- true
Bounded -- false --- false
by -- true --- false
the -- true --- false
Indian -- false --- false
Ocean -- false --- false
on -- true --- false
the -- true --- false
south -- false --- false
, -- false --- true
the -- true --- false
Arabian -- false --- false
Sea -- false --- false
on -- true --- false
the -- true --- false
southwest -- false --- false
, -- false --- true
and -- true --- false
the -- true --- false
Bay -- false --- false
of -- true --- false
Bengal -- false --- false
on -- true --- false
the -- true --- false
southeast -- false --- false
, -- false --- true
it -- true --- false
shares -- false --- false
land -- false --- false
borders -- false --- false
with -- true --- false
Pakistan -- false --- false
to -- true --- false
the -- true --- false
west;[f -- false --- false
] -- false --- true
China -- false --- false
, -- false --- true
Nepal -- false --- false
, -- false --- true
and -- true --- false
Bhutan -- false --- false
to -- true --- false
the -- true --- false
north -- false --- false
; -- false --- true
and -- true --- false
Bangladesh -- false --- false
and -- true --- false
Myanmar -- false --- false
to -- true --- false
the -- true --- false
east -- false --- false
. -- false --- true
In -- true --- false
the -- true --- false
Indian -- false --- false
Ocean -- false --- false
, -- false --- true
India -- false --- false
is -- true --- false
in -- true --- false
the -- true --- false
vicinity -- false --- false
of -- true --- false
Sri -- false --- false
Lanka -- false --- false
and -- true --- false
the -- true --- false
Maldives -- false --- false
; -- false --- true
its -- true --- false
Andaman -- false --- false
and -- true --- false
Nicobar -- false --- false
Islands -- false --- false
share -- false --- false
a -- true --- false
maritime -- false --- false
border -- false --- false
with -- true --- false
Thailand -- false --- false
and -- true --- false
Indonesia -- false --- false
. -- false --- true
# Removing StopWords and punctuations
cleaned_my_doc = [token for token in my_doc if not token.is_stop and not token.is_punct]
for token in cleaned_my_doc:
print(token.text)
India
Hindi
Bhārat
officially
Republic
India
Hindi
Bhārat
Gaṇarājya),[23
country
South
Asia
second
populous
country
seventh
largest
country
land
area
populous
democracy
world
Bounded
Indian
Ocean
south
Arabian
Sea
southwest
Bay
Bengal
southeast
shares
land
borders
Pakistan
west;[f
China
Nepal
Bhutan
north
Bangladesh
Myanmar
east
Indian
Ocean
India
vicinity
Sri
Lanka
Maldives
Andaman
Nicobar
Islands
share
maritime
border
Thailand
Indonesia
# Lemmatizing the tokens of a new_doc
my_text='she played chess against rita she likes playing chess.'
new_doc=nlp(my_text)
for token in new_doc:
print(token.lemma_)
-PRON-
play
chess
against
rita
-PRON-
like
play
chess
.
# Strings to Hashes and Back
doc2 = nlp("I love playing")
# Look up the hash for the word "playing"
word_hash = nlp.vocab.strings["playing"]
print(word_hash)
# Look up the word_hash to get the string
word_string = nlp.vocab.strings[word_hash]
print(word_string)
13803694918078379268
playing
# Creating two different documents with a common word
doc_1 = nlp('Adidas shoes are famous')
doc_2 = nlp('I washed my shoes ')
# Printing the hash value for each token in the doc
print('**********DOC 1**********')
for token in doc_1:
hash_value=nlp.vocab.strings[token.text]
print(token.text ,' ',hash_value)
print()
print('**********DOC 2**********')
for token in doc_2:
hash_value=nlp.vocab.strings[token.text]
print(token.text ,' ',hash_value)
**********DOC 1**********
Adidas 2449447890689859070
shoes 2716266617130919512
are 5012629990875267006
famous 17809293829314912000
**********DOC 2**********
I 4690420944186131903
washed 5520327350569975027
my 227504873216781231
shoes 2716266617130919512
# Printing the tokens which are like numbers
text=' Year 2020 is far worse than 2009'
doc=nlp(text)
for token in doc:
if token.like_num:
print(token)
2020
2009
# POS tagging using spaCy
my_text='John plays basketball,if time permits. He played in high school too.'
my_doc=nlp(my_text)
for token in my_doc:
print(token.text,'---- ',token.pos_)
John ---- PROPN
plays ---- VERB
basketball ---- NOUN
, ---- PUNCT
if ---- SCONJ
time ---- NOUN
permits ---- VERB
. ---- PUNCT
He ---- PRON
played ---- VERB
in ---- ADP
high ---- ADJ
school ---- NOUN
too ---- ADV
. ---- PUNCT
spacy.explain('SCONJ')
'subordinating conjunction'
# Preparing the spaCy document
text='Tony Stark owns the company StarkEnterprises . Emily Clark works at Microsoft and lives in Manchester. She loves to read the Bible and learn French'
doc=nlp(text)
# Printing the named entities
print(doc.ents)
(Tony Stark, StarkEnterprises, Emily Clark, Microsoft, Manchester, Bible, French)
# Printing labels of entities.
for entity in doc.ents:
print(entity.text,'--- ',entity.label_)
Tony Stark --- PERSON
StarkEnterprises --- ORG
Emily Clark --- PERSON
Microsoft --- ORG
Manchester --- GPE
Bible --- WORK_OF_ART
French --- NORP
# Using displacy for visualizing NER
from spacy import displacy
displacy.render(doc,style='ent',jupyter=true)