202511191259 Status: idea Tags: Datascience, NLP, Text representation Technique
One-Hot Encoding
One hot encoding is a method that converts categorical values into numerical vectors so that machine learning models can process them. Each category is turned into a binary vector where exactly one position is marked with a 1 and all others are 0. This removes any unintended ordering between categories.
Why it is used
- Models cannot work with raw text categories.
- Prevents the model from assuming numeric relationships between categories.
- Simple and effective for small to medium category sets.
How it works
Example 1: Simple categories
Original values:
| Value |
|---|
| Red |
| Blue |
| Green |
One hot encoded:
| Value | Red | Blue | Green |
|---|---|---|---|
| Red | 1 | 0 | 0 |
| Blue | 0 | 1 | 0 |
| Green | 0 | 0 | 1 |
Example 2: Repeated categories
Original data:
| Item | Color |
|---|---|
| A | Red |
| B | Green |
| C | Red |
| D | Blue |
One hot encoded:
| Item | Red | Blue | Green |
|---|---|---|---|
| A | 1 | 0 | 0 |
| B | 0 | 0 | 1 |
| C | 1 | 0 | 0 |
| D | 0 | 1 | 0 |
Advantages
- Removes false numeric relationships.
- Easy to interpret.
- Works well for many ML algorithms.
Limitations
- Produces wide vectors when categories are numerous.
- High memory usage for large vocabularies.
- Sparse representation may slow down some models.
When to use it
- When categories are not too many.
- When preserving non-ordinal relationships is important.
- When working with simple classical ML models.
References
- Dit is iets wat we leren voor Datascience. dit was informatie vanuit avans 2-2 datascience 2025-11-12. en daarbij horen deze slides
- I was writing a note about NLP which mentions this.
- geeks for geeks: https://www.geeksforgeeks.org/machine-learning/ml-one-hot-encoding/