202511191232 Status: idea Tags: Datascience, NLP
NLP tokenization
Tokenization is the process of breaking text into smaller units called tokens, such as words, characters, or phrases. Tokenization is the process of breaking text into smaller units called tokens, such as words, characters, or phrases.
Types
- Word Tokenization : Splits text into individual words. Works well for languages with clear word boundaries.
- Character Tokenization : Splits text into individual characters. Useful for languages without clear boundaries or spelling tasks.
- Sub-word Tokenization : Breaks text into units between characters and words.
- Sentence Tokenization : Divides paragraphs into separate sentences.
- N-gram Tokenization : Splits words into fixed-sized chunks of length n. Exemple (bi-
gram, n=2) :
"I love NLP"→[("I", "love"), ("love", "NLP")]
References
- Dit is iets wat we leren voor Datascience. dit was informatie vanuit avans 2-2 datascience 2025-11-12. en daarbij horen deze slides
- I was writing a note about NLP which mentions this.