202511191300 Status: idea Tags: Datascience, NLP, Text representation Technique
Bag of Words
The Bag of Words (BoW) model is a simple and widely used text representation technique. It converts text into numerical form by counting how frequently each word appears in a document. It ignores grammar, word order, and context, focusing only on word occurrence.
What BoW Does
- Builds a vocabulary of all unique words in a dataset.
- Represents each document as a vector of word counts.
- Allows text to be used in machine learning tasks such as classification, clustering, and sentiment analysis.
Key Concepts
Vocabulary
A set of all unique words collected from the dataset.
Document Vector
A numerical vector where each position corresponds to a word in the vocabulary.
The value is the frequency of that word in the document.
Properties
- Simple and interpretable
- High dimensional
- Ignores word order
- Sensitive to noise unless normalization steps are applied
Example
Sample Sentences
- I love NLP
- Deep learning is fun
- NLP and deep learning are popular
- I enjoy learning NLP
Bag of Words Matrix (Word Counts)
| Sentence | I | love | enjoy | NLP | deep | learning | is | fun | and | are | popular |
|---|---|---|---|---|---|---|---|---|---|---|---|
| I love NLP | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Deep learning is fun | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 |
| NLP and deep learning are popular | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 1 |
| I enjoy learning NLP | 1 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
Steps to Build a Bag of Words Model
| Step | Description |
|---|---|
| Preprocessing | Convert to lowercase, remove punctuation and numbers, remove extra spaces |
| Tokenization | Split text into individual words |
| Frequency Counting | Count occurrences of each word |
| Vocabulary Construction | Collect unique words across documents |
| Vector Construction | Create a matrix of word counts for each document |
When BoW Works Well
- Text classification
- Topic modeling
- Simple clustering tasks
- Systems where interpretability is important
Limitations
- Ignores context and meaning
- Vocabulary can become large
- Does not handle synonyms
- Word order is lost
References
- Dit is iets wat we leren voor Datascience. dit was informatie vanuit avans 2-2 datascience 2025-11-12. en daarbij horen deze slides
- I was writing a note about NLP which mentions this.
- geeks for geeks: https://www.geeksforgeeks.org/nlp/bag-of-words-bow-model-in-nlp/