Project 1: Most Common NLP tasks using SpaCy

Shashank Shanu

8 months ago

NLP Tasks Using SpaCy
NLP Tasks Using SpaCy

Problem Statement: Perform all the NLP tasks using spacy and show the results with explanation

What is SpaCy?

SpaCy is an advanced modern library for Natural Language Processing developed by Matthew Honnibal and Ines Montani. This notebook is a complete guide to learn how to use spaCy for various NLP tasks.

What are the NLP topics covered by this notebook?

This notebook tries to cover almost all the NLP tasks used in any end to end NLP projects. The Topics which are to be covered are given below:
  • How to install SpaCy?
  • Creating some documents to perform NLP tasks
  • How to perform Tokenization with spaCy?
  • How to do Text-Preprocessing with spaCy?
  • How to perform Lemmatization using spacy?
  • How to print Hashes values of Strings?
  • How to find Lexical attributes of spaCy?
  • Part of Speech analysis with spaCy
  • Named Entity Recognition using spacy
  • Conclusion

1. How to install SpaCy?

  • If you want to install spacy you may run the below command by removing the # symbol.
#pip install spacy
  • If the spacy package gets installed properly. You may get the successful message.

Import spacy into jupyter notebook

import spacy
The spaCy package comes with different pre-trained NLP models that can be used to perform most common NLP tasks, such as tokenization, parts of speech (POS) tagging, named entity recognition (NER), lemmatization, transforming to word vectors etc.
If you are dealing with a particular language, you can load the spacy model specific to the language using spacy.load() function.

Loading English model: https://spacy.io/models

nlp=spacy.load("en_core_web_sm")
nlp
Output:
  • This will return us a Language object that comes ready with multiple built-in capabilities.

2. Creating some documents to perform NLP tasks

# Crete document 
my_text = "India (Hindi: Bhārat), officially the Republic of India (Hindi: Bhārat Gaṇarājya),[23] is a country in South Asia. It is the second-most populous country, the seventh-largest country by land area, and the most populous democracy in the world. Bounded by the Indian Ocean on the south, the Arabian Sea on the southwest, and the Bay of Bengal on the southeast, it shares land borders with Pakistan to the west;[f] China, Nepal, and Bhutan to the north; and Bangladesh and Myanmar to the east. In the Indian Ocean, India is in the vicinity of Sri Lanka and the Maldives; its Andaman and Nicobar Islands share a maritime border with Thailand and Indonesia."

my_doc = nlp(my_text)
type(my_doc)
print()
print(my_doc)
Output:
India (Hindi: Bhārat), officially the Republic of India (Hindi: Bhārat Gaṇarājya),[23] is a country in South Asia. It is the second-most populous country, the seventh-largest country by land area, and the most populous democracy in the world. Bounded by the Indian Ocean on the south, the Arabian Sea on the southwest, and the Bay of Bengal on the southeast, it shares land borders with Pakistan to the west;[f] China, Nepal, and Bhutan to the north; and Bangladesh and Myanmar to the east. In the Indian Ocean, India is in the vicinity of Sri Lanka and the Maldives; its Andaman and Nicobar Islands share a maritime border with Thailand and Indonesia.
  • Note: I simply took this paragraph from Wikipedia

3. How to perform Tokenization with spaCy?

What is tokenization?

Tokenization is the process of converting a text into smaller sub-texts, based on certain predefined rules. For example, sentences are tokenized to words (and punctuation optionally). And paragraphs into sentences, depending on the context.
This is typically the first step for NLP tasks like text classification, sentiment analysis, etc.
Each token in spacy has different attributes that tell us a great deal of information.**
Such as, if the token is punctuation, what part-of-speech (POS) is it, what is the lemma of the word etc. This article will cover everything from A-Z.
# Printing the tokens of a my_doc
for token in my_doc:
  print(token.text)
Output:
India
(
Hindi
:
Bhārat
)
,
officially
the
Republic
of
India
(
Hindi
:
Bhārat
Gaṇarājya),[23
]
is
a
country
in
South
Asia
.
It
is
the
second
-
most
populous
country
,
the
seventh
-
largest
country
by
land
area
,
and
the
most
populous
democracy
in
the
world
.
Bounded
by
the
Indian
Ocean
on
the
south
,
the
Arabian
Sea
on
the
southwest
,
and
the
Bay
of
Bengal
on
the
southeast
,
it
shares
land
borders
with
Pakistan
to
the
west;[f
]
China
,
Nepal
,
and
Bhutan
to
the
north
;
and
Bangladesh
and
Myanmar
to
the
east
.
In
the
Indian
Ocean
,
India
is
in
the
vicinity
of
Sri
Lanka
and
the
Maldives
;
its
Andaman
and
Nicobar
Islands
share
a
maritime
border
with
Thailand
and
Indonesia
.
As we can that the above tokens also contain punctuation and common words like “a”, ” the”, “was”, etc. These do not add any value to the meaning of your text. They are called stop words.
So we need them to be removed.

4. How to do Text-Preprocessing with spaCy?

What is the need for Text Preprocessing?
The outcome of the NLP task you perform, be it classification, finding sentiments, topic modelling etc, the quality of the output depends heavily on the quality of the input text used.
Stop words and punctuation usually (not always) don’t add value to the meaning of the text and can potentially impact the outcome. To avoid this, it might make sense to remove them and clean the text of unwanted characters can reduce the size of the corpus.
How to identify and remove the stopwords and punctuation?
The tokens in spacy have attributes which will help you identify if it is a stop word or not.
The token.is_stop attribute tells you that. Likewise, token.is_punct and token.is_space tell you if a token is punctuation and white space respectively.
# Printing tokens and boolean values stored in different attributes
for token in my_doc:
  print(token.text,'--',token.is_stop,'---',token.is_punct)
Output:
India -- false --- false
( -- false --- true
Hindi -- false --- false
: -- false --- true
Bhārat -- false --- false
) -- false --- true
, -- false --- true
officially -- false --- false
the -- true --- false
Republic -- false --- false
of -- true --- false
India -- false --- false
( -- false --- true
Hindi -- false --- false
: -- false --- true
Bhārat -- false --- false
Gaṇarājya),[23 -- false --- false
] -- false --- true
is -- true --- false
a -- true --- false
country -- false --- false
in -- true --- false
South -- false --- false
Asia -- false --- false
. -- false --- true
It -- true --- false
is -- true --- false
the -- true --- false
second -- false --- false
- -- false --- true
most -- true --- false
populous -- false --- false
country -- false --- false
, -- false --- true
the -- true --- false
seventh -- false --- false
- -- false --- true
largest -- false --- false
country -- false --- false
by -- true --- false
land -- false --- false
area -- false --- false
, -- false --- true
and -- true --- false
the -- true --- false
most -- true --- false
populous -- false --- false
democracy -- false --- false
in -- true --- false
the -- true --- false
world -- false --- false
. -- false --- true
Bounded -- false --- false
by -- true --- false
the -- true --- false
Indian -- false --- false
Ocean -- false --- false
on -- true --- false
the -- true --- false
south -- false --- false
, -- false --- true
the -- true --- false
Arabian -- false --- false
Sea -- false --- false
on -- true --- false
the -- true --- false
southwest -- false --- false
, -- false --- true
and -- true --- false
the -- true --- false
Bay -- false --- false
of -- true --- false
Bengal -- false --- false
on -- true --- false
the -- true --- false
southeast -- false --- false
, -- false --- true
it -- true --- false
shares -- false --- false
land -- false --- false
borders -- false --- false
with -- true --- false
Pakistan -- false --- false
to -- true --- false
the -- true --- false
west;[f -- false --- false
] -- false --- true
China -- false --- false
, -- false --- true
Nepal -- false --- false
, -- false --- true
and -- true --- false
Bhutan -- false --- false
to -- true --- false
the -- true --- false
north -- false --- false
; -- false --- true
and -- true --- false
Bangladesh -- false --- false
and -- true --- false
Myanmar -- false --- false
to -- true --- false
the -- true --- false
east -- false --- false
. -- false --- true
In -- true --- false
the -- true --- false
Indian -- false --- false
Ocean -- false --- false
, -- false --- true
India -- false --- false
is -- true --- false
in -- true --- false
the -- true --- false
vicinity -- false --- false
of -- true --- false
Sri -- false --- false
Lanka -- false --- false
and -- true --- false
the -- true --- false
Maldives -- false --- false
; -- false --- true
its -- true --- false
Andaman -- false --- false
and -- true --- false
Nicobar -- false --- false
Islands -- false --- false
share -- false --- false
a -- true --- false
maritime -- false --- false
border -- false --- false
with -- true --- false
Thailand -- false --- false
and -- true --- false
Indonesia -- false --- false
. -- false --- true
Now as we get the information about which token is stopwords or not. Using this information, let’s remove the stopwords and punctuations.
# Removing StopWords and punctuations
cleaned_my_doc = [token for token in my_doc if not token.is_stop and not token.is_punct]

for token in cleaned_my_doc:
  print(token.text)
Output:
India
Hindi
Bhārat
officially
Republic
India
Hindi
Bhārat
Gaṇarājya),[23
country
South
Asia
second
populous
country
seventh
largest
country
land
area
populous
democracy
world
Bounded
Indian
Ocean
south
Arabian
Sea
southwest
Bay
Bengal
southeast
shares
land
borders
Pakistan
west;[f
China
Nepal
Bhutan
north
Bangladesh
Myanmar
east
Indian
Ocean
India
vicinity
Sri
Lanka
Maldives
Andaman
Nicobar
Islands
share
maritime
border
Thailand
Indonesia
You can now see that the cleaned doc has only tokens that contribute to meaning in some way.
Also, the computational costs decrease by a great amount due to reducing the number of tokens. In order to grasp the effect of Preprocessing on large text data, you can execute the below code
  • Text-Preprocessing helps us in removing more than half of the tokens and Makes the processing faster and meaningful.

5. How to perform Lemmatization using spacy?

Lemmatization is the method of converting a token to it’s root/base form.
SpaCy provides a very easy and robust solution for this and is considered as one of the optimal implementations.
After you’ve formed the Document object (by using nlp()), you can access the root form of every token through Token.lemma_ attribute.
# Lemmatizing the tokens of a new_doc
my_text='she played chess against rita she likes playing chess.'
new_doc=nlp(my_text)
for token in new_doc:
  print(token.lemma_)
Output:
-PRON-
play
chess
against
rita
-PRON-
like
play
chess
.
Note: This lemma_ method also prints ‘PRON’ when it encounters a pronoun as shown above. You might have to explicitly handle them.

6. How to print Hashes values of Strings?

SpaCy hashes or converts each string to a unique ID that is stored in the StringStore.
But, what is StringStore?
It’s a dictionary mapping of hash values to strings, for example 10543432924755684266 –> box
You can print the hash value if you know the string and vice-versa. This is contained in nlp.vocab.strings as shown below.
# Strings to Hashes and Back
doc2 = nlp("I love playing")

# Look up the hash for the word "playing"
word_hash = nlp.vocab.strings["playing"]
print(word_hash)

# Look up the word_hash to get the string
word_string = nlp.vocab.strings[word_hash]
print(word_string)
Output:
13803694918078379268
playing
It seems Interesting, a word will have the same hash value irrespective of which document it occurs in or which spaCy model is being used.
So your results are reproducible even if you run your code in someone else’s machine.
# Creating two different documents with a common word
doc_1 = nlp('Adidas shoes are famous')
doc_2 = nlp('I washed my shoes ')

# Printing the hash value for each token in the doc

print('**********DOC 1**********')
for token in doc_1:
  hash_value=nlp.vocab.strings[token.text]
  print(token.text ,' ',hash_value)

print()
print('**********DOC 2**********')
for token in doc_2:
  hash_value=nlp.vocab.strings[token.text]
  print(token.text ,' ',hash_value)
Output:
**********DOC 1**********
Adidas   2449447890689859070
shoes   2716266617130919512
are   5012629990875267006
famous   17809293829314912000

**********DOC 2**********
I   4690420944186131903
washed   5520327350569975027
my   227504873216781231
shoes   2716266617130919512
You can notice that ‘ shoes ‘have the same hash value irrespective of which document it occurs in. This saves memory space.

7. How to find Lexical attributes of spaCy?

In this section, you will learn about a few more significant lexical attributes.
The spaCy model provides many useful lexical attributes. These are the attributes of Token object, that give you information on the type of token.
For example, you can use like_num attribute of a token to check if it is a number. Let’s print all the numbers in a text.
# Printing the tokens which are like numbers
text=' Year 2020 is far worse than 2009'
doc=nlp(text)
for token in doc:
  if token.like_num:
    print(token)
Output:
2020
2009
  • That's how we can find Lexical attributes. There are many different attributes which we can find using different functions.

8. Part of Speech analysis with spaCy

Let us consider a sentence, “Shashank likes playing football”.
Here, Shashank is a NOUN, and playing is a VERB. Likewise, each word of a text is either a noun, pronoun, verb, conjunction, etc. These tags are called as Part of Speech tags (POS).
How to identify the part of speech of the words in a text document?
It is present in the pos_ attribute.
# POS tagging using spaCy
my_text='John plays basketball,if time permits. He played in high school too.'
my_doc=nlp(my_text)
for token in my_doc:
  print(token.text,'---- ',token.pos_)
Output:
John ----  PROPN
plays ----  VERB
basketball ----  NOUN
, ----  PUNCT
if ----  SCONJ
time ----  NOUN
permits ----  VERB
. ----  PUNCT
He ----  PRON
played ----  VERB
in ----  ADP
high ----  ADJ
school ----  NOUN
too ----  ADV
. ----  PUNCT
From the above output, you can see the POS tag against each word like VERB, ADJ, etc..
What if you don’t know what the tag SCONJ means?
Using spacy.explain() function, you can know the explanation or full-form in this case.
spacy.explain('SCONJ')
Output:
'subordinating conjunction'

9. Named Entity Recognition using spacy

Have a look at this text “Tom works at Apple″. In this, ” Tom ” and ” Apple ” are names of a person and a company. These words are referred as named-entities. They are real-world objects like name of a company , place,etc..
How can find all the named-entities in a text ?
Using spaCy’s ents attribute on a document, you can access all the named-entities present in the text.
# Preparing the spaCy document
text='Tony Stark owns the company StarkEnterprises . Emily Clark works at Microsoft and lives in Manchester. She loves to read the Bible and learn French'
doc=nlp(text)

# Printing the named entities
print(doc.ents)
Output:
(Tony Stark, StarkEnterprises, Emily Clark, Microsoft, Manchester, Bible, French)
You can see all the named entities printed.
But , is this complete information ? NO.
Each named entity belongs to a category, like name of a person, or an organization, or a city, etc. The common Named Entity categories supported by spacy are :
  • PERSON : Denotes names of people
  • GPE : Denotes places like counties, cities, states.
  • ORG : Denotes organizations or companies
  • WORK_OF_ART : Denotes titles of books, fimls,songs and other arts
  • PRODUCT : Denotes products such as vehicles, food items ,furniture and so on.
  • EVENT : Denotes historical events like wars, disasters ,etc…
  • LANGUAGE : All the recognized languages across the globe.
  • How can you find out which named entity category does a given text belong to?
You can access the same through .label_ attribute of spacy. It prints the label of named entities as shown below.
# Printing labels of entities.
for entity in doc.ents:
  print(entity.text,'--- ',entity.label_)
Output:
Tony Stark ---  PERSON
StarkEnterprises ---  ORG
Emily Clark ---  PERSON
Microsoft ---  ORG
Manchester ---  GPE
Bible ---  WORK_OF_ART
French ---  NORP
spaCy also provides special visualization for NER through displacy. Using displacy.render() function, you can set the style=ent to visualize.
# Using displacy for visualizing NER
from spacy import displacy
displacy.render(doc,style='ent',jupyter=true)
Output:
Output
Output

10. Conclusion

These are some of the most commonly NLP tasks are required while we are deal with any big NLP real-world projects.
I hope you enjoyed this project and finally, you came to know about most commonly NLP tasks, its working and different types. You also get an idea of how you may implement it in python.
For more such blogs/courses on data science, machine learning, artificial intelligence and emerging new technologies do visit us at InsideAIML.
Thanks for reading…
Happy Learning…

Submit Review