202511191222 Status: idea Tags: Datascience, NLP

NLP Text Normalization

Text normalization transfoms text step by step:

  1. Convert text to lowercase
  2. Remove irrelevant numbers
  3. Remove punctuation
  4. Strip leading/trailing whitespace
  5. Remove stop words Outcome: Cleaned and consistent text ready for NLP tasks

You can use simple tools for this:

  • regex for extracting patterns like mails etc.

In Python it is pretty easy to do most of these:

1. Convert text to lowercase

Lower_string = sting.lower()
Print(lower_string)

2. Remove irrelevant numbers

import re no_number_string = re.sub(r’\d+’, ’’, lower_string)
print(no_number_string)

3. Remove punctuation

no_punc_string = re.sub(r’[^\w\s]’, ’’, no_number_string)
print(no_punc_string)

4. Strip leading/trailing whitespaces

no_wspace_string = no_punc_string.strip()
print(no_wspace_string)

5. Remove stop words

import nltk
nltk.download(’stopwords’)
from nltk.corpus import stopwords
stop_words = set(stopwords.words(’english’))
lst_string = no_wspace_string.split()
no_stpwords_string = " ".join([w for w in lst_string if w not in stop_words])
print(no_stpwords_string

Though I must say, this last one really differs depending on your language.


References