Counting Tokens from the Paragraph

Nithya Rekha

a year ago

First we should know, what is meant by token?
  A token is an instance of a sequence of characters in some particular document that are grouped together as a useful semantic unit for processing. A type is the class of all tokens containing the same character sequence. 
Now we will see, How to find tokens in the paragraph or in a document.
While reading text from source, sometimes we should find the statistics about the type of words used in paragraph. It makes necessary to count the number of words or number of lines contain in that paragraph. In the below example we will see how to count the tokens present in the paragraph. For this purpose, we will consider a Series.
Reading a File
FileName = ("F:\\INSIDEAIML\\moa_summary.txt")

with open(FileName, 'r') as file:
    lines_in_file = file.read()
    print(lines_in_file)
When we run the above lines of code, we will get the following output-
Memories of the Alhambra deals with augmented reality games that alter what a person experiences around them.  A CEO (Hyun Bin) travels to Spain seeking to develop an AR game and stays at the hostel of a woman (Park Shin Hye). But things take a surprising turn as unexpected things begin to happen because of the game.

It’s time to head to Spain for my next drama which was the very anticipated Memories of the Alhambra. I like sci-fi type stories, so the AR concept was right up my alley. I’ve previously enjoyed shows that deal with virtual reality like the anime Sword Are Online and the Chinese drama Love O20. This drama is in the same vein, mind you with some differences because of the AR focus.

We have Hyun Bin as Yoo Jin Woo. He’s the CEO hoping to get the new AR game developed, and he is willing to do whatever it takes make it happen. Initially, he’s thrilled with the immersive experience of the game, but soon, things get a little too real as the game starts to take over his life.

Park Shin Hye plays Jung Hee Joo. She runs a hostel in Spain. Her brother is the inventor of the AR game, but she has no idea about it. When Jin Woo stays at her hostel, she finds herself becoming more and more involved with him as strange things begin to happen.

There was a lot of hype leading up to Memories of the Alhambra. So did it live up? For me, it did. But there are definitely some things to consider before diving in. First, the pluses. It definitely has some unique qualities that set it apart from other dramas. A good chunk of the show is set on location in Spain. I loved this! It gives us tons of gorgeous scenic shots, and it really just has a unique atmosphere that I thoroughly enjoyed.

The concept of Augmented Reality games is also a very different topic. The drama uses a lot of special effects to really give us a window into the experience of users in the game. It makes use of swords, guns, and a variety of in-game characters. There are some pretty awesome sword battles as well as plenty of other action too.
Using NLTK library to count the words.
Here we use NLTK module to count the words that are present in the text. Here NLTK takes ',' as a token, so we get more number of tokens.
import nltk

FileName = ("F:\\INSIDEAIML\\moa_summary.txt")

with open(FileName, 'r') as file:
    lines_in_file = file.read()
    
    nltk_tokens = nltk.word_tokenize(lines_in_file)
    print(nltk_tokens)
    print("Number of Words: " , len(nltk_tokens))

If we execute the above lines of code we will get the following output-
['Memories', 'of', 'the', 'Alhambra', 'deals', 'with', 'augmented', 'reality', 'games', 'that', 'alter', 'what', 'a', 'person', 'experiences', 'around', 'them', '.', 'A', 'CEO', '(', 'Hyun', 'Bin', ')', 'travels', 'to', 'Spain', 'seeking', 'to', 'develop', 'an', 'AR', 'game', 'and', 'stays', 'at', 'the', 'hostel', 'of', 'a', 'woman', '(', 'Park', 'Shin', 'Hye', ')', '.', 'But', 'things', 'take', 'a', 'surprising', 'turn', 'as', 'unexpected', 'things', 'begin', 'to', 'happen', 'because', 'of', 'the', 'game', '.', 'It’s', 'time', 'to', 'head', 'to', 'Spain', 'for', 'my', 'next', 'drama', 'which', 'was', 'the', 'very', 'anticipated', 'Memories', 'of', 'the', 'Alhambra', '.', 'I', 'like', 'sci-fi', 'type', 'stories', ',', 'so', 'the', 'AR', 'concept', 'was', 'right', 'up', 'my', 'alley', '.', 'I’ve', 'previously', 'enjoyed', 'shows', 'that', 'deal', 'with', 'virtual', 'reality', 'like', 'the', 'anime', 'Sword', 'Are', 'Online', 'and', 'the', 'Chinese', 'drama', 'Love', 'O20', '.', 'This', 'drama', 'is', 'in', 'the', 'same', 'vein', ',', 'mind', 'you', 'with', 'some', 'differences', 'because', 'of', 'the', 'AR', 'focus', '.', 'We', 'have', 'Hyun', 'Bin', 'as', 'Yoo', 'Jin', 'Woo', '.', 'He’s', 'the', 'CEO', 'hoping', 'to', 'get', 'the', 'new', 'AR', 'game', 'developed', ',', 'and', 'he', 'is', 'willing', 'to', 'do', 'whatever', 'it', 'takes', 'make', 'it', 'happen', '.', 'Initially', ',', 'he’s', 'thrilled', 'with', 'the', 'immersive', 'experience', 'of', 'the', 'game', ',', 'but', 'soon', ',', 'things', 'get', 'a', 'little', 'too', 'real', 'as', 'the', 'game', 'starts', 'to', 'take', 'over', 'his', 'life', '.', 'Park', 'Shin', 'Hye', 'plays', 'Jung', 'Hee', 'Joo', '.', 'She', 'runs', 'a', 'hostel', 'in', 'Spain', '.', 'Her', 'brother', 'is', 'the', 'inventor', 'of', 'the', 'AR', 'game', ',', 'but', 'she', 'has', 'no', 'idea', 'about', 'it', '.', 'When', 'Jin', 'Woo', 'stays', 'at', 'her', 'hostel', ',', 'she', 'finds', 'herself', 'becoming', 'more', 'and', 'more', 'involved', 'with', 'him', 'as', 'strange', 'things', 'begin', 'to', 'happen', '.', 'There', 'was', 'a', 'lot', 'of', 'hype', 'leading', 'up', 'to', 'Memories', 'of', 'the', 'Alhambra', '.', 'So', 'did', 'it', 'live', 'up', '?', 'For', 'me', ',', 'it', 'did', '.', 'But', 'there', 'are', 'definitely', 'some', 'things', 'to', 'consider', 'before', 'diving', 'in', '.', 'First', ',', 'the', 'pluses', '.', 'It', 'definitely', 'has', 'some', 'unique', 'qualities', 'that', 'set', 'it', 'apart', 'from', 'other', 'dramas', '.', 'A', 'good', 'chunk', 'of', 'the', 'show', 'is', 'set', 'on', 'location', 'in', 'Spain', '.', 'I', 'loved', 'this', '!', 'It', 'gives', 'us', 'tons', 'of', 'gorgeous', 'scenic', 'shots', ',', 'and', 'it', 'really', 'just', 'has', 'a', 'unique', 'atmosphere', 'that', 'I', 'thoroughly', 'enjoyed', '.', 'The', 'concept', 'of', 'Augmented', 'Reality', 'games', 'is', 'also', 'a', 'very', 'different', 'topic', '.', 'The', 'drama', 'uses', 'a', 'lot', 'of', 'special', 'effects', 'to', 'really', 'give', 'us', 'a', 'window', 'into', 'the', 'experience', 'of', 'users', 'in', 'the', 'game', '.', 'It', 'makes', 'use', 'of', 'swords', ',', 'guns', ',', 'and', 'a', 'variety', 'of', 'in-game', 'characters', '.', 'There', 'are', 'some', 'pretty', 'awesome', 'sword', 'battles', 'as', 'well', 'as', 'plenty', 'of', 'other', 'action', 'too', '.']
Number of Words:  427
We can solve the above task by using split() function-
FileName = ("F:\\INSIDEAIML\\moa_summary.txt")

with open(FileName, 'r') as file:
    lines_in_file = file.read()

    print(lines_in_file.split())
    print("\n")
    print("Number of Words: ", len(lines_in_file.split()))
If we run the above program we will get the following output- 
['Memories', 'of', 'the', 'Alhambra', 'deals', 'with', 'augmented', 'reality', 'games', 'that', 'alter', 'what', 'a', 'person', 'experiences', 'around', 'them.', 'A', 'CEO', '(Hyun', 'Bin)', 'travels', 'to', 'Spain', 'seeking', 'to', 'develop', 'an', 'AR', 'game', 'and', 'stays', 'at', 'the', 'hostel', 'of', 'a', 'woman', '(Park', 'Shin', 'Hye).', 'But', 'things', 'take', 'a', 'surprising', 'turn', 'as', 'unexpected', 'things', 'begin', 'to', 'happen', 'because', 'of', 'the', 'game.', 'It’s', 'time', 'to', 'head', 'to', 'Spain', 'for', 'my', 'next', 'drama', 'which', 'was', 'the', 'very', 'anticipated', 'Memories', 'of', 'the', 'Alhambra.', 'I', 'like', 'sci-fi', 'type', 'stories,', 'so', 'the', 'AR', 'concept', 'was', 'right', 'up', 'my', 'alley.', 'I’ve', 'previously', 'enjoyed', 'shows', 'that', 'deal', 'with', 'virtual', 'reality', 'like', 'the', 'anime', 'Sword', 'Are', 'Online', 'and', 'the', 'Chinese', 'drama', 'Love', 'O20.', 'This', 'drama', 'is', 'in', 'the', 'same', 'vein,', 'mind', 'you', 'with', 'some', 'differences', 'because', 'of', 'the', 'AR', 'focus.', 'We', 'have', 'Hyun', 'Bin', 'as', 'Yoo', 'Jin', 'Woo.', 'He’s', 'the', 'CEO', 'hoping', 'to', 'get', 'the', 'new', 'AR', 'game', 'developed,', 'and', 'he', 'is', 'willing', 'to', 'do', 'whatever', 'it', 'takes', 'make', 'it', 'happen.', 'Initially,', 'he’s', 'thrilled', 'with', 'the', 'immersive', 'experience', 'of', 'the', 'game,', 'but', 'soon,', 'things', 'get', 'a', 'little', 'too', 'real', 'as', 'the', 'game', 'starts', 'to', 'take', 'over', 'his', 'life.', 'Park', 'Shin', 'Hye', 'plays', 'Jung', 'Hee', 'Joo.', 'She', 'runs', 'a', 'hostel', 'in', 'Spain.', 'Her', 'brother', 'is', 'the', 'inventor', 'of', 'the', 'AR', 'game,', 'but', 'she', 'has', 'no', 'idea', 'about', 'it.', 'When', 'Jin', 'Woo', 'stays', 'at', 'her', 'hostel,', 'she', 'finds', 'herself', 'becoming', 'more', 'and', 'more', 'involved', 'with', 'him', 'as', 'strange', 'things', 'begin', 'to', 'happen.', 'There', 'was', 'a', 'lot', 'of', 'hype', 'leading', 'up', 'to', 'Memories', 'of', 'the', 'Alhambra.', 'So', 'did', 'it', 'live', 'up?', 'For', 'me,', 'it', 'did.', 'But', 'there', 'are', 'definitely', 'some', 'things', 'to', 'consider', 'before', 'diving', 'in.', 'First,', 'the', 'pluses.', 'It', 'definitely', 'has', 'some', 'unique', 'qualities', 'that', 'set', 'it', 'apart', 'from', 'other', 'dramas.', 'A', 'good', 'chunk', 'of', 'the', 'show', 'is', 'set', 'on', 'location', 'in', 'Spain.', 'I', 'loved', 'this!', 'It', 'gives', 'us', 'tons', 'of', 'gorgeous', 'scenic', 'shots,', 'and', 'it', 'really', 'just', 'has', 'a', 'unique', 'atmosphere', 'that', 'I', 'thoroughly', 'enjoyed.', 'The', 'concept', 'of', 'Augmented', 'Reality', 'games', 'is', 'also', 'a', 'very', 'different', 'topic.', 'The', 'drama', 'uses', 'a', 'lot', 'of', 'special', 'effects', 'to', 'really', 'give', 'us', 'a', 'window', 'into', 'the', 'experience', 'of', 'users', 'in', 'the', 'game.', 'It', 'makes', 'use', 'of', 'swords,', 'guns,', 'and', 'a', 'variety', 'of', 'in-game', 'characters.', 'There', 'are', 'some', 'pretty', 'awesome', 'sword', 'battles', 'as', 'well', 'as', 'plenty', 'of', 'other', 'action', 'too.']


Number of Words:  383

Submit Review