What is a Bag-of-Words Model ?

By Albert, 2 months ago
  • Bookmark

how to extraction  feature from text data using Bag-of-Word model?

1 Answer

A bag-of-words model, or BoW for short, is a way of extracting features from text for use in modeling, such as with machine learning algorithms and to represent text as numerical feature vectors.

A bag-of-words is a representation of text that describes the occurrence of words within a document. It involves two things:

  1. A vocabulary of known words.
  2. A measure of the presence of known words.

It is called a bag of words, because any information about the order or structure of words in the document is discarded. The model is only concerned with whether known words occur in the document, not where in the document.

The idea behind the bag-of-words model is quite simple and can be summarized as follows:

  1. We create a vocabulary of unique tokens. for example, words from the entire set of documents.
  2. We construct a feature vector from each document that contains the counts of how often each word occurs in the particular document.

Implement Bag of Words using Python Keras

from keras.preprocessing.text import Tokenizer

text = [
     'There was a man',
     'The man had a dog',
     'The dog and the man walked',

Fit a Tokenizer on the text

model = Tokenizer()

Get Bag of Words representation

rep = model.texts_to_matrix(text, mode='count')

Output :

Key : ['man', 'the', 'a', 'dog', 'there', 'was', 'had', 'and', 'walked']
[[0. 1. 0. 1. 0. 1. 1. 0. 0. 0.]
 [0. 1. 1. 1. 1. 0. 0. 1. 0. 0.]
 [0. 1. 2. 0. 1. 0. 0. 0. 1. 1.]]

Your Answer


How To Land a Job in Data Science

Jun 24th (7:00 PM) 238 Registered
More webinars

Related Discussions

Running random forest algorithm with one variable

View More