World's Best AI Learning Platform with profoundly Demanding Certification Programs
Designed by IITians, only for AI Learners.
how to extraction feature from text data using Bag-of-Word model?
A bag-of-words model, or BoW for short, is a way of extracting features from text for use in modeling, such as with machine learning algorithms and to represent text as numerical feature vectors.
A bag-of-words is a representation of text that describes the occurrence of words within a document. It involves two things:
It is called a bag of words, because any information about the order or structure of words in the document is discarded. The model is only concerned with whether known words occur in the document, not where in the document.
The idea behind the bag-of-words model is quite simple and can be summarized as follows:
Implement Bag of Words using Python Keras
from keras.preprocessing.text import Tokenizer
text = [ 'There was a man', 'The man had a dog', 'The dog and the man walked', ]
Fit a Tokenizer on the text
model = Tokenizer() model.fit_on_texts(text)
Get Bag of Words representation
rep = model.texts_to_matrix(text, mode='count') print(rep)
Key : ['man', 'the', 'a', 'dog', 'there', 'was', 'had', 'and', 'walked'] [[0. 1. 0. 1. 0. 1. 1. 0. 0. 0.] [0. 1. 1. 1. 1. 0. 0. 1. 0. 0.] [0. 1. 2. 0. 1. 0. 0. 0. 1. 1.]]
Running random forest algorithm with one variable