202511191222
Status: idea
Tags: Datascience, NLP

NLP Text Normalization

Text normalization transfoms text step by step:

Convert text to lowercase
Remove irrelevant numbers
Remove punctuation
Strip leading/trailing whitespace
Remove stop words Outcome: Cleaned and consistent text ready for NLP tasks

You can use simple tools for this:

regex for extracting patterns like mails etc.

In Python it is pretty easy to do most of these:

1. Convert text to lowercase

Lower_string = sting.lower()
Print(lower_string)

2. Remove irrelevant numbers

import re no_number_string = re.sub(r’\d+’, ’’, lower_string)
print(no_number_string)

3. Remove punctuation

no_punc_string = re.sub(r’[^\w\s]’, ’’, no_number_string)
print(no_punc_string)

4. Strip leading/trailing whitespaces

no_wspace_string = no_punc_string.strip()
print(no_wspace_string)

5. Remove stop words

import nltk
nltk.download(’stopwords’)
from nltk.corpus import stopwords
stop_words = set(stopwords.words(’english’))
lst_string = no_wspace_string.split()
no_stpwords_string = " ".join([w for w in lst_string if w not in stop_words])
print(no_stpwords_string

Though I must say, this last one really differs depending on your language.

References

Dit is iets wat we leren voor Datascience. dit was informatie vanuit avans 2-1 datascience 2025-11-12. en daarbij horen deze slides
I was writing a note about NLP which mentions this.

🌵OldMartijntje

Explorer

NLP Text Normalization

NLP Text Normalization

1. Convert text to lowercase

2. Remove irrelevant numbers

3. Remove punctuation

4. Strip leading/trailing whitespaces

5. Remove stop words

References

Graph View

Table of Contents

Backlinks