202511191232
Status: idea
Tags: Datascience, NLP

NLP tokenization

Tokenization is the process of breaking text into smaller units called tokens, such as words, characters, or phrases. Tokenization is the process of breaking text into smaller units called tokens, such as words, characters, or phrases.

Types

Word Tokenization : Splits text into individual words. Works well for languages with clear word boundaries.
Character Tokenization : Splits text into individual characters. Useful for languages without clear boundaries or spelling tasks.
Sub-word Tokenization : Breaks text into units between characters and words.
Sentence Tokenization : Divides paragraphs into separate sentences.
N-gram Tokenization : Splits words into fixed-sized chunks of length n. Exemple (bi- gram, n=2) : "I love NLP" → [("I", "love"), ("love", "NLP")]

References

Dit is iets wat we leren voor Datascience. dit was informatie vanuit avans 2-1 datascience 2025-11-12. en daarbij horen deze slides
I was writing a note about NLP which mentions this.

🌵OldMartijntje

Explorer

NLP tokenization

NLP tokenization

Types

References

Graph View

Table of Contents

Backlinks