Before an AI model can generate human-like responses, summarize documents, or translate languages, it faces a surprisingly difficult task: understanding raw text. And not in the way humans do. For machines, language is not inherently meaningful, it’s just a sequence of symbols, patterns, and noise.
This is where the real story of Natural Language Processing (NLP) begins, not with intelligence, but with transformation.
This blog dives deep into that transformation: how raw text is broken down, cleaned, structured, and eventually turned into mathematical representations that machines can learn from. Along the way, we’ll explore how techniques evolved from rigid rule-based systems to flexible neural architectures, and why representation, not just modelling, is the true backbone of modern AI.
If you’re looking to start your journey in Generative AI and NLP, understanding this foundation is essential.
Imagine feeding the following sentence to a machine:
“OMG!!! This movie is soooo goooood 🔥 #MustWatch”
To a human, this clearly conveys excitement and positive sentiment.
To a machine, however, it looks like a chaotic mix of:
If we don’t process this text properly, even the most advanced model will struggle.
Tokenization is the process of splitting text into smaller units called tokens. These tokens form the building blocks for all NLP tasks.
But here’s where it gets interesting: there are multiple ways to tokenize text, and each comes with trade-offs.
Fig 1.1 Tokenization in NLP
At first glance, splitting text into words seems natural:
“I love AI” → [“I”, “love”, “AI”]
This approach works well for simple tasks and aligns with human understanding. However, it quickly runs into limitations:
This leads to inefficiencies and poor generalization.
“I love AI” → [‘I’, ‘ ‘, ‘l’, ‘o’, ‘v’, ‘e’, …]
This approach eliminates the problem of unknown words entirely. Every possible input can be represented.
However, it introduces new challenges:
While powerful, character-level models often require more data and deeper architectures.
Modern NLP systems rely heavily on subword tokenization, which strikes a balance between word-level and character-level approaches.
“unbelievable” → [“un”, “believ”, “able”]
This is why most transformer-based models use subword tokenization-it offers both flexibility and efficiency.
Once tokenized, text still contains inconsistencies that can confuse models. Text normalization aims to standardize input.
Reduces vocabulary size:
“AI” and “ai” → “ai”
“Hello!!!” → “Hello”
“The cat is on the mat” → “cat mat”
Reduces words to root forms:
“running”, “runs” → “run”
More linguistically accurate:
“better” → “good”
While normalization improves consistency, it can also remove important context.
“I am not happy” → “happy”
The meaning is completely reversed.
This is why modern neural approaches often minimize aggressive preprocessing-models can learn patterns directly from raw data.
Language diversity introduces another layer of complexity.
Each language has its own script, structure, and encoding challenges.
Unicode provides a consistent way to represent characters across languages.
‘é’ → U+00E9
Without Unicode, global NLP systems would be fragmented and inconsistent.
“Kal meeting hai bro”
These advancements allow a single model to understand multiple languages effectively.
Once text is cleaned and tokenized, it must be converted into a numerical format.
Machines don’t understand words, they understand vectors.
Each word is represented as a vector with a single “1”:
“cat” → [0, 0, 1, 0, 0]
While simple, this approach has major drawbacks:
Word2Vec introduced the idea that meaning comes from context.
Words that appear in similar contexts have similar representations.
“king” – “man” + “woman” ≈ “queen”
This demonstrated that embeddings can capture semantic relationships.
GloVe (Global Vectors) uses co-occurrence statistics across the entire corpus.
This results in richer representations.
FastText improves embeddings by considering character-level information.
Instead of treating words as atomic units, it breaks them into n-grams.
Fig 1.2 Word Embeddings in NLP
Traditional embeddings assign a single vector per word.
“bank” → same vector always
Modern models generate embeddings dynamically.
“I went to the bank” → financial meaning
“The river bank is wide” → geographical meaning
Models like transformers capture context by analyzing surrounding words.
Contextual embeddings enable:
This shift is one of the biggest breakthroughs in NLP.
Once words are represented as vectors, we can measure how similar they are.
Measures the angle between two vectors.
Embeddings can solve relationships:
Paris : France :: Tokyo : Japan
This shows that models capture structure in language, not just words.
Before deep learning, NLP relied on handcrafted rules.
Examples:
Strengths:
Weaknesses:
Engineers manually design features:
Example:
“This movie is great” → positive sentiment
Features might include:
Limitations:
Modern NLP replaces manual design with learning.
Fig 1.3 Neural NLP
Instead of:
Text → Features → Model → Output
We now have:
Text → Model → Output
The model learns:
Advantages:
Challenges:
Evaluation is critical but complex.
Accuracy:
Useful for classification.
Precision:
Measures correctness of positive predictions.
Recall:
Measures completeness.
F1 Score:
Balances precision and recall.
BLEU:
Used for translation.
BLEU:
Used for summarization.
Perplexity:
Measures language model uncertainty.
Metrics don’t always reflect real-world performance.
A chatbot might score high on benchmarks but still feel unnatural.
Human evaluation remains essential.
Let’s connect everything:
Each step plays a crucial role.
A mistake early in the pipeline can cascade into poor results later.
At its core, NLP is about representation.
The way we encode language determines:
Better representation leads to better intelligence.
The field continues to evolve rapidly.
Multimodal Models:
Combining text, images, and audio.
Unified Architectures
One model handling multiple tasks.
Low-Resource Learning
Performing well with limited data.
Longer Context Understanding
Handling entire documents instead of short snippets.
When you interact with AI, it feels natural. Almost human.
But underneath, there is a complex pipeline:
What feels like understanding is actually mathematical pattern recognition at scale.
And it all begins with how we preprocess and represent text.
If you’re building AI systems, remember:
Because in the end AI doesn’t understand language the way humans do.
It understands the representation of language we give it.
If you’re ready to go beyond theory and start your journey in Generative AI today, now is the perfect time to begin.