202511191259 Status: idea Tags: Datascience, NLP, Text representation Technique

One-Hot Encoding

One hot encoding is a method that converts categorical values into numerical vectors so that machine learning models can process them. Each category is turned into a binary vector where exactly one position is marked with a 1 and all others are 0. This removes any unintended ordering between categories.

Why it is used

  • Models cannot work with raw text categories.
  • Prevents the model from assuming numeric relationships between categories.
  • Simple and effective for small to medium category sets.

How it works

Example 1: Simple categories

Original values:

Value
Red
Blue
Green

One hot encoded:

ValueRedBlueGreen
Red100
Blue010
Green001

Example 2: Repeated categories

Original data:

ItemColor
ARed
BGreen
CRed
DBlue

One hot encoded:

ItemRedBlueGreen
A100
B001
C100
D010

Advantages

  • Removes false numeric relationships.
  • Easy to interpret.
  • Works well for many ML algorithms.

Limitations

  • Produces wide vectors when categories are numerous.
  • High memory usage for large vocabularies.
  • Sparse representation may slow down some models.

When to use it

  • When categories are not too many.
  • When preserving non-ordinal relationships is important.
  • When working with simple classical ML models.

References