NLP - Word Tokenization with Python

Shashank Shanu

9 months ago

Word Tokenization with Python
Word Tokenization with Python

What is NLP?

Natural Language Processing, or NLP for short, is a process of converting human languages into something which a machine can understand so that we can process it and get some useful outputs.
Here I will try to take you through one of the processes used by the NLTK package to perform NLP.

Now, what is NLTK?

Natural Language The toolkit is a framework used for natural language processing works.
NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries.
Here I will take you through the word tokenization technique.
Word tokenization is the process of splitting a large sample of text into words. This is a requirement in natural language processing tasks where each word needs to be captured and subjected to further analysis like classifying and counting them for a particular sentiment etc. The Natural Language Tool kit (NLTK) is a library used to achieve this. Install NLTK before proceeding with the python program for word tokenization.
Python program for word tokenization
Python program for word tokenization

Lets we how we can do it in python

We can install nltk as
conda install -c anaconda nltk
Next, we use the word tokenization method to split the paragraph into individual words.
Let’s see it with an example
import nltk

word_data = "It originated from the idea that there are readers who prefer learning new skills from the comforts of their drawing rooms"
nltk_tokens = nltk.word_tokenize(word_data)
print (nltk_tokens)
When we execute the above code, it produces the following result.
['It', 'originated', 'from', 'the', 'idea', 'that', 'there', 'are', 'readers', 
'who', 'prefer', 'learning', 'new', 'skills', 'from', 'the',
'comforts', 'of', 'their', 'drawing', 'rooms']

Tokenizing Sentences

We can also tokenize the sentences in a paragraph like we tokenized the words. We use the method sent tokenize to achieve this. Below is an example.
import nltk
sentence_data = "Sun rises in the east. Sun sets in the west."
nltk_tokens = nltk.sent_tokenize(sentence_data)
print (nltk_tokens)
When we execute the above code, it produces the following result.
['Sun rises in the east.', 'Sun sets in the west.']

Why Tokenization is so important?

Tokenization is the first step to proceed with NLP. And, it is important because by this, words are identified, demarcated and classified.
This not only permits to further steps in processing but this allows the computer to deal with words in a much manageable form (internally each word is represented by code and, remember, the computer is a number oriented device).
I hope you enjoyed reading this article and finally, you came to know about NLP - Word Tokenization with Python.
For more such blogs/courses on data science, machine learning, artificial intelligence and emerging new technologies do visit us at InsideAIML.
Thanks for reading…
Happy Learning…

Submit Review