World's Best AI Learning Platform with profoundly Demanding Certification Programs
Designed by IITians, only for AI Learners.
Designed by IITians, only for AI Learners.
New to InsideAIML? Create an account
Employer? Create an account
A bag-of-words model, or BoW for short, is a way of extracting features from text for use in modeling, such as with machine learning algorithms and to represent text as numerical feature vectors.
A bag-of-words is a representation of text that describes the occurrence of words within a document. It involves two things:
It is called a bag of words, because any information about the order or structure of words in the document is discarded. The model is only concerned with whether known words occur in the document, not where in the document.
The idea behind the bag-of-words model is quite simple and can be summarized as follows:
Implement Bag of Words using Python Keras
from keras.preprocessing.text import Tokenizer
text = [ 'There was a man', 'The man had a dog', 'The dog and the man walked', ]
Fit a Tokenizer on the text
model = Tokenizer() model.fit_on_texts(text)
Get Bag of Words representation
rep = model.texts_to_matrix(text, mode='count') print(rep)
Output :
Key : ['man', 'the', 'a', 'dog', 'there', 'was', 'had', 'and', 'walked'] [[0. 1. 0. 1. 0. 1. 1. 0. 0. 0.] [0. 1. 1. 1. 1. 0. 0. 1. 0. 0.] [0. 1. 2. 0. 1. 0. 0. 0. 1. 1.]]