World's Best AI Learning Platform with profoundly Demanding Certification Programs
Designed by IITians, only for AI Learners.
Internship Partner

In Association with
In collaboration with



Designed by IITians, only for AI Learners.
Internship Partner
In Association with
In collaboration with
New to InsideAIML? Create an account
Employer? Create an account
Designed by IITians, only for AI Learners.
Internship Partner
In Association with
In collaboration with
Enter your email below and we will send a message to reset your password
Designed by IITians, only for AI Learners.
Internship Partner
In Association with
In collaboration with
By providing your contact details, you agree to our Terms of Use & Privacy Policy.
Already have an account? Sign In
Designed by IITians, only for AI Learners.
Internship Partner
In Association with
In collaboration with
By providing your contact details, you agree to our Terms of Use & Privacy Policy.
Already have an account? Sign In
Download our e-book of Introduction To Python
4.5 (1,292 Ratings)
589 Learners
Neha Kumawat
2 years ago
['Data', 'science', 'is', 'an', 'inter-disciplinary', 'field', 'that', 'uses', 'scientific',
'methods', ',', 'processes', ',', 'algorithms', 'and', 'systems', 'to', 'extract', 'knowledge',
'and', 'insights', 'from', 'many', 'structural', 'and', 'unstructured', 'data', '.', 'Data',
'science', 'is', 'related', 'to', 'data', 'mining', ',', 'machine', 'learning', 'and', 'big',
'data', '.', 'Data', 'science', 'is', 'a', 'concept', 'to', 'unify', 'statistics', ',', 'data',
'analysis', ',', 'machine', 'learning', ',', 'domain', 'knowledge', 'and', 'their', 'related',
'methods', 'in', 'order', 'to', 'understand', 'and', 'analyze', 'actual', 'phenomena', 'with', 'data',
'.', 'It', 'uses', 'techniques', 'and', 'theories', 'drawn', 'from', 'many', 'fields',
'within, 'the', 'context', 'of', 'mathematics', ',', 'statistics', ',', 'computer',
'science', ',', 'domain', 'knowledge', 'and', 'information', 'science', '.', 'Turing',
'award', 'winner', 'Jim', 'Gray', 'imagined', 'data', 'science', 'as', 'a', 'fourth',
'paradigm', 'of', 'science',
'(', 'empirical', ',', 'theoretical', ',', 'computational', 'and', 'now', 'data-driven', ')',
'and', 'asserted', 'that', 'everything', 'about', 'science', 'is', 'changing', 'because',
'of', 'the', 'impact', 'of', 'information', 'technology', 'and', 'the', 'data', 'deluge']
['Data', 'science', 'is', 'an', 'inter-disciplinary', 'field', 'that', 'uses', 'scientific',
'methods', ',', 'processes', ',', 'algorithms', 'and', 'systems', 'to', 'extract', 'knowledge',
'and', 'insights', 'from', 'many', 'structural', 'and', 'unstructured', 'data', '.', 'Data',
'science', 'is', 'related', 'to', 'data', 'mining', ',', 'machine', 'learning', 'and', 'big',
'data', '.', 'Data', 'science', 'is', 'a', 'concept', 'to', 'unify', 'statistics', ',', 'data',
'analysis', ',', 'machine', 'learning', ',', 'domain', 'knowledge', 'and', 'their', 'related',
'methods', 'in', 'order', 'to', 'understand', 'and', 'analyze', 'actual', 'phenomena', 'with', 'data',
'.', 'It', 'uses', 'techniques', 'and', 'theories', 'drawn', 'from', 'many', 'fields', 'within, 'the',
'context', 'of', 'mathematics', ',', 'statistics', ',', 'computer', 'science', ',', 'domain', 'knowledge',
'and', 'information', 'science', '.', 'Turing', 'award', 'winner', 'Jim', 'Gray', 'imagined', 'data',
'science', 'as', 'a', 'fourth', 'paradigm', 'of', 'science',
'(', 'empirical', ',', 'theoretical', ',', 'computational', 'and', 'now', 'data-driven', ')',
'and', 'asserted', 'that', 'everything', 'about', 'science', 'is', 'changing', 'because', 'of',
'the', 'impact', 'of', 'information', 'technology', 'and', 'the', 'data', 'deluge']
# Tokenizing using NLTK
import nltk
data = "Data
science is an inter-disciplinary field that uses scientific methods, processes,
algorithms and systems to extract knowledge and insights from many structural
and unstructured data. Data science is related to data mining, machine learning
and big data. Data science is a concept to unify statistics, data analysis,
machine learning, domain knowledge and their related methods in order to
understand and analyze actual phenomena with data. It uses techniques and
theories drawn from many fields within the context of mathematics, statistics,
computer science, domain knowledge and information science. Turing award winner
Jim Gray imagined data science as a fourth paradigm of science (empirical,
theoretical, computational and now data-driven) and asserted that everything
about science is changing because of the impact of information technology and
the data deluge"
# sentence tokenizes
nltk.sent_tokenize(data)
The result is
shown below:
['Data science is an inter-disciplinary field that uses scientific methods,
processes, algorithms and systems to extract knowledge and insights from many
structural and unstructured data.',
'Data science is related to data mining, machine learning and big data.',
'Data science is a concept to unify statistics, data analysis, machine learning,
domain knowledge and their related methods in order to understand and analyze actual
phenomena with data.',
'It uses techniques and theories drawn from many fields within the context of mathematics,
statistics, computer science, domain knowledge and information science.',
'Turing award winner Jim Gray imagined data science as a fourth paradigm of
science (empirical, theoretical, computational and now data-driven) and asserted
that everything about science is changing because of the impact of information
technology and the data deluge']
# word tokenizes
nltk.word_tokenize(data)
The result is
shown below:
['Data', 'science', 'is', 'an', 'inter-disciplinary', 'field', 'that', 'uses',
'scientific', 'methods', ',', 'processes', ',', 'algorithms', 'and', 'systems',
'to', 'extract', 'knowledge', 'and', 'insights', 'from', 'many', 'structural',
'and', 'unstructured', 'data', '.', 'Data', 'science', 'is', 'related', 'to', 'data',
'mining', ',', 'machine', 'learning', 'and', 'big', 'data', '.', 'Data', 'science',
'is', 'a', 'concept', 'to', 'unify', 'statistics', ',', 'data', 'analysis', ',',
'machine', 'learning', ',', 'domain', 'knowledge', 'and', 'their', 'related',
'methods', 'in', 'order', 'to', 'understand', 'and', 'analyze', 'actual', 'phenomena', 'with', 'data',
'.', 'It', 'uses', 'techniques', 'and', 'theories', 'drawn', 'from', 'many',
'fields', 'within, 'the', 'context', 'of', 'mathematics', ',', 'statistics', ',',
'computer', 'science', ',', 'domain', 'knowledge', 'and', 'information', 'science', '.',
'Turing', 'award', 'winner', 'Jim', 'Gray', 'imagined', 'data', 'science', 'as', 'a',
'fourth', 'paradigm', 'of', 'science',
'(', 'empirical', ',', 'theoretical', ',', 'computational', 'and', 'now', 'data-driven', ')',
'and', 'asserted', 'that', 'everything', 'about', 'science', 'is', 'changing',
'because', 'of', 'the', 'impact', 'of', 'information', 'technology', 'and', 'the', 'data', 'deluge']
[('He', 'PRP'),
('wants', 'VBZ'),
('to', 'TO'),
('play', 'VB'),
('football', 'NN')]
my_node = "MN:
{<NNP>*<NN>}"
chunk = nltk.RegexpParser(my_node)
result = chunk.parse(pos)
print(result)
(S
We/PRP
will/MD
see/VB
an/DT
(MN example/NN)
of/IN
(MN POS/NNP tagging/NN)
./.)
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
# print first 10 stop words
stop_words[:10]
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]
import string
punct = string.punctuation
punct
'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
import nltk
import string
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
punct =string.punctuation
data1 = data
clean_data =[]
for word in nltk.word_tokenize(data1):
if word not in punct:
if word not in stop_words:
clean_data.append(word)
clean_data
['Data', 'science', 'inter-disciplinary', 'field', 'uses', 'scientific',
'methods', 'processes', 'algorithms', 'systems', 'extract', 'knowledge',
'insights', 'many', 'structural', 'unstructured', 'data', 'Data', 'science',
'related', 'data', 'mining', 'machine', 'learning', 'big', 'data', 'Data',
'science', 'concept', 'unify', 'statistics', 'data', 'analysis', 'machine',
‘learning', 'domain', 'knowledge', 'related', 'methods', 'order', 'understand',
'analyze', 'actual', 'phenomena', 'data', 'It', 'uses', 'techniques', 'theories',
'drawn', 'many', 'fields', 'within', 'context', 'mathematics', 'statistics', 'computer',
'science', 'domain', 'knowledge', 'information', 'science', 'Turing', 'award', 'winner',
'Jim', 'Gray', 'imagined', 'data', 'science', 'fourth', 'paradigm', 'science', 'empirical',
'theoretical', 'computational', 'data-driven', 'asserted', 'everything', 'science',
'changing', 'impact', 'information', 'technology', 'data', 'deluge']
nltk.pos_tag(clean_data)
[('Data', 'NNP'),
('science', 'NN'),
('inter-disciplinary', 'JJ'),
('field', 'NN'),
('uses', 'VBZ'),
('scientific', 'JJ'),
('methods', 'NNS'),
('processes', 'VBZ'),
('algorithms', 'JJ'),
('systems', 'NNS'),
('extract', 'JJ'),
('knowledge', 'NNP'),
('insights', 'NNS'),
('many', 'JJ'),
('structural', 'JJ'),
('unstructured', 'JJ'),
('data', 'NNS'),
('Data', 'NNS'),
('science', 'NN'),
('related', 'VBN'),
('data', 'NNS'),
('mining', 'NN'),
('machine', 'NN'),
('learning', 'VBG'),
('big', 'JJ'),
('data', 'NNS'),
('Data', 'NNP'),
('science', 'NN'),
('concept', 'NN'),
('unify', 'JJ'),
('statistics', 'NNS'),
('data', 'NNS'),
('analysis', 'NN'),
('machine', 'NN'),
('learning', 'VBG'),
('domain', 'NN'),
('knowledge', 'NN'),
('related', 'VBN'),
('methods', 'NNS'),
('order', 'NN'),
('understand', 'VBP'),
('analyze', 'NN'),
('actual', 'JJ'),
('phenomena', 'NN'),
('data', 'NNS'),
('It', 'PRP'),
('uses', 'VBZ'),
('techniques', 'NNS'),
('theories', 'NNS'),
('drawn', 'VBP'),
('many', 'JJ'),
('fields', 'NNS'),
('within', 'IN'),
('context', 'NN'),
('mathematics', 'NNS'),
('statistics', 'NNS'),
('computer', 'NN'),
('science', 'NN'),
('domain', 'NN'),
('knowledge', 'NN'),
('information', 'NN'),
('science', 'NN'),
('Turing', 'NNP'),
('award', 'NN'),
('winner', 'NN'),
('Jim', 'NNP'),
('Gray', 'NNP'),
('imagined', 'VBD'),
('data', 'NNS'),
('science', 'NN'),
('fourth', 'JJ'),
('paradigm', 'NN'),
('science', 'NN'),
('empirical', 'JJ'),
('theoretical', 'JJ'),
('computational', 'JJ'),
('data-driven', 'JJ'),
('asserted', 'VBD'),
('everything', 'NN'),
('science', 'NN'),
('changing', 'VBG'),
('impact', 'JJ'),
('information', 'NN'),
('technology', 'NN'),
('data', 'NNS'),
('deluge', 'NN')]
from nltk.stem import PorterStemmer
from nltk.stem import LancasterStemmer, SnowballStemmer
lancaster = LancasterStemmer()
porter = PorterStemmer()
Snowball = SnowballStemmer("english")
print('Porter stemmer')
print(porter.stem("hobby"))
print(porter.stem("hobbies"))
print(porter.stem("computer"))
print(porter.stem("computation"))
print("------------------------------------")
print('lancaster stemmer')
print(lancaster.stem("hobby"))
print(lancaster.stem("hobbies"))
print(lancaster.stem("computer"))
print(porter.stem("computation"))
print("------------------------------------")
print('Snowball stemmer')
print(Snowball.stem("hobby"))
print(Snowball.stem("hobbies"))
print(Snowball.stem("computer"))
print(Snowball.stem("computation"))
Porter stemmer
hobbi
hobbi
comput
comput
------------------------------------
lancaster stemmer
hobby
hobby
comput
comput
------------------------------------
Snowball stemmer
hobbi
hobbi
comput
comput
print(porter.stem("playing"))
print(porter.stem("plays"))
print(porter.stem("play"))
play
play
play
from nltk.stem import WordNetLemmatizer
lemma = WordNetLemmatizer()
print(lemma.lemmatize('playing'))
print(lemma.lemmatize('plays'))
print(lemma.lemmatize('play'))
playing
play
play
print(lemma.lemmatize('playing',pos='v'))
print(lemma.lemmatize('plays',pos='v'))
print(lemma.lemmatize('play',pos='v'))
play
play
play
text_data = "India,
officially the Republic of India, is a country in South Asia."
words = nltk.word_tokenize(text_data)
pos_tag = nltk.pos_tag(words)
namedEntity = nltk.ne_chunk(pos_tag)
print(namedEntity)
namedEntity.draw()
(S
(GPE India/NNP)
,/,
officially/RB
the/DT
(ORGANIZATION Republic/NNP)
of/IN
(GPE India/NNP)
,/,
is/VBZ
a/DT
country/NN
in/IN
(GPE South/NNP Asia/NNP)
./.)
['BllipParser',
'BottomUpChartParser',
'BottomUpLeftCornerChartParser',
'BottomUpProbabilisticChartParser',
'ChartParser',
'CoreNLPDependencyParser',
'CoreNLPParser',
'DependencyEvaluator',
'DependencyGraph',
'EarleyChartParser',
'FeatureBottomUpChartParser',
'FeatureBottomUpLeftCornerChartParser',
'FeatureChartParser',
'FeatureEarleyChartParser',
'FeatureIncrementalBottomUpChartParser',
'FeatureIncrementalBottomUpLeftCornerChartParser',
'FeatureIncrementalChartParser',
'FeatureIncrementalTopDownChartParser',
'FeatureTopDownChartParser',
'IncrementalBottomUpChartParser',
'IncrementalBottomUpLeftCornerChartParser',
'IncrementalChartParser',
'IncrementalLeftCornerChartParser',
'IncrementalTopDownChartParser',
'InsideChartParser',
'LeftCornerChartParser',
'LongestChartParser',
'MaltParser',
'NaiveBayesDependencyScorer',
'NonprojectiveDependencyParser',
'ParserI',
'ProbabilisticNonprojectiveParser',
'ProbabilisticProjectiveDependencyParser',
'ProjectiveDependencyParser',
'RandomChartParser',
'RecursiveDescentParser',
'ShiftReduceParser',
'SteppingChartParser',
'SteppingRecursiveDescentParser',
'SteppingShiftReduceParser',
'TestGrammar',
'TopDownChartParser',
'TransitionParser',
'UnsortedChartParser',
'ViterbiParser',
'__builtins__',
'__cached__',
'__doc__',
'__file__',
'__loader__',
'__name__',
'__package__',
'__path__',
'__spec__',
'api',
'bllip',
'chart',
'corenlp',
'dependencygraph',
'earleychart',
'evaluate',
'extract_test_sentences',
'featurechart',
'load_parser',
'malt',
'nonprojectivedependencyparser',
'pchart',
'projectivedependencyparser',
'recursivedescent',
'shiftreduce',
'transitionparser',
'util',
'viterbi']
grammar = nltk.CFG.fromstring("""
S -> NP VP
VP -> V NP | V NP PP
PP -> P NP
V -> "saw" | "slept" | "walked"
NP -> "Rahul" | "Anjali" | Det N | Det N PP
Det -> "a" | "an" | "the" | "my"
N -> "man" | "dog" | "cat" | "telescope" | "park"
P -> "in" | "on" | "by" | "with"
""")
text = "Rahul saw Anjali with a dog".split()
parser = nltk.RecursiveDescentParser(grammar)
for tree in parser.parse(sent):
print(tree)
tree.draw()
(S
(NP Rahul)
(VP (V saw) (NP Anjali) (PP (P with) (NP (Det a) (N dog)))))