202511191222 Status: idea Tags: Datascience, NLP
NLP Text Normalization
Text normalization transfoms text step by step:
- Convert text to lowercase
- Remove irrelevant numbers
- Remove punctuation
- Strip leading/trailing whitespace
- Remove stop words Outcome: Cleaned and consistent text ready for NLP tasks
You can use simple tools for this:
- regex for extracting patterns like mails etc.
In Python it is pretty easy to do most of these:
1. Convert text to lowercase
Lower_string = sting.lower()
Print(lower_string)2. Remove irrelevant numbers
import re no_number_string = re.sub(r’\d+’, ’’, lower_string)
print(no_number_string)3. Remove punctuation
no_punc_string = re.sub(r’[^\w\s]’, ’’, no_number_string)
print(no_punc_string)4. Strip leading/trailing whitespaces
no_wspace_string = no_punc_string.strip()
print(no_wspace_string)5. Remove stop words
import nltk
nltk.download(’stopwords’)
from nltk.corpus import stopwords
stop_words = set(stopwords.words(’english’))
lst_string = no_wspace_string.split()
no_stpwords_string = " ".join([w for w in lst_string if w not in stop_words])
print(no_stpwords_stringThough I must say, this last one really differs depending on your language.
References
- Dit is iets wat we leren voor Datascience. dit was informatie vanuit avans 2-2 datascience 2025-11-12. en daarbij horen deze slides
- I was writing a note about NLP which mentions this.