202511191232 Status: idea Tags: Datascience, NLP

NLP tokenization

Tokenization is the process of breaking text into smaller units called tokens, such as words, characters, or phrases. Tokenization is the process of breaking text into smaller units called tokens, such as words, characters, or phrases.

Types

  • Word Tokenization : Splits text into individual words. Works well for languages with clear word boundaries.
  • Character Tokenization : Splits text into individual characters. Useful for languages without clear boundaries or spelling tasks.
  • Sub-word Tokenization : Breaks text into units between characters and words.
  • Sentence Tokenization : Divides paragraphs into separate sentences.
  • N-gram Tokenization : Splits words into fixed-sized chunks of length n. Exemple (bi- gram, n=2) : "I love NLP"[("I", "love"), ("love", "NLP")]

References