All Courses

What is a Bag-of-Words Model ?

By Albert, 2 years ago
  • Bookmark
0

how to extraction  feature from text data using Bag-of-Word model?

Bag-of-words
1 Answer
0

A bag-of-words model, or BoW for short, is a way of extracting features from text for use in modeling, such as with machine learning algorithms and to represent text as numerical feature vectors.

A bag-of-words is a representation of text that describes the occurrence of words within a document. It involves two things:

  1. A vocabulary of known words.
  2. A measure of the presence of known words.

It is called a bag of words, because any information about the order or structure of words in the document is discarded. The model is only concerned with whether known words occur in the document, not where in the document.


The idea behind the bag-of-words model is quite simple and can be summarized as follows:

  1. We create a vocabulary of unique tokens. for example, words from the entire set of documents.
  2. We construct a feature vector from each document that contains the counts of how often each word occurs in the particular document.


Implement Bag of Words using Python Keras

from keras.preprocessing.text import Tokenizer


text = [
     'There was a man',
     'The man had a dog',
     'The dog and the man walked',
]


Fit a Tokenizer on the text

model = Tokenizer()
model.fit_on_texts(text)


Get Bag of Words representation

rep = model.texts_to_matrix(text, mode='count')
print(rep)

Output :

Key : ['man', 'the', 'a', 'dog', 'there', 'was', 'had', 'and', 'walked']
[[0. 1. 0. 1. 0. 1. 1. 0. 0. 0.]
 [0. 1. 1. 1. 1. 0. 0. 1. 0. 0.]
 [0. 1. 2. 0. 1. 0. 0. 0. 1. 1.]]

Your Answer

Webinars

Why You Should Learn Data Science in 2023?

Jun 8th (7:00 PM) 289 Registered
More webinars

Related Discussions

Running random forest algorithm with one variable

View More