• Home /
  • Gen AI /
  • What if the smartest AI you’ve ever used… actually begins its journey by struggling to read?

What if the smartest AI you’ve ever used… actually begins its journey by struggling to read?

It sounds counterintuitive, but it’s true.

Before an AI model can generate human-like responses, summarize documents, or translate languages, it faces a surprisingly difficult task: understanding raw text. And not in the way humans do. For machines, language is not inherently meaningful, it’s just a sequence of symbols, patterns, and noise.

This is where the real story of Natural Language Processing (NLP) begins, not with intelligence, but with transformation.

This blog dives deep into that transformation: how raw text is broken down, cleaned, structured, and eventually turned into mathematical representations that machines can learn from. Along the way, we’ll explore how techniques evolved from rigid rule-based systems to flexible neural architectures, and why representation, not just modelling, is the true backbone of modern AI.

If you’re looking to start your journey in Generative AI and NLP, understanding this foundation is essential.

From Chaos to Structure: Why Text Preprocessing Matters

Imagine feeding the following sentence to a machine:

“OMG!!! This movie is soooo goooood 🔥 #MustWatch”

To a human, this clearly conveys excitement and positive sentiment.
To a machine, however, it looks like a chaotic mix of:

If we don’t process this text properly, even the most advanced model will struggle.

Text preprocessing is not just a preliminary step, it is the foundation of everything that follows.

Tokenization: Teaching Machines to Read

Tokenization is the process of splitting text into smaller units called tokens. These tokens form the building blocks for all NLP tasks.

But here’s where it gets interesting: there are multiple ways to tokenize text, and each comes with trade-offs.

Fig 1.1 Tokenization in NLP

Word-Level Tokenization: The Intuitive Start

At first glance, splitting text into words seems natural:

“I love AI” → [“I”, “love”, “AI”]

This approach works well for simple tasks and aligns with human understanding. However, it quickly runs into limitations:

This leads to inefficiencies and poor generalization.

Character-Level Tokenization: Going Deeper

“I love AI” → [‘I’, ‘ ‘, ‘l’, ‘o’, ‘v’, ‘e’, …]

This approach eliminates the problem of unknown words entirely. Every possible input can be represented.

However, it introduces new challenges:

While powerful, character-level models often require more data and deeper architectures.

Subword Tokenization: The Sweet Spot

Modern NLP systems rely heavily on subword tokenization, which strikes a balance between word-level and character-level approaches.

Example:

“unbelievable” → [“un”, “believ”, “able”]

Popular algorithms include:
Why subword tokenization works so well:

This is why most transformer-based models use subword tokenization-it offers both flexibility and efficiency.

Cleaning the Data: Text Normalisation

Once tokenized, text still contains inconsistencies that can confuse models. Text normalization aims to standardize input.

Common Techniques

Lowercasing

Reduces vocabulary size:

“AI” and “ai” → “ai”

Removing Punctuation

“Hello!!!” → “Hello”

Stopword Removal

“The cat is on the mat” → “cat mat”

Stemming

Reduces words to root forms:

“running”, “runs” → “run”

Lemmatization

More linguistically accurate:

“better” → “good”

The Trade-Off

While normalization improves consistency, it can also remove important context.

Example:

“I am not happy” → “happy”

The meaning is completely reversed.

This is why modern neural approaches often minimize aggressive preprocessing-models can learn patterns directly from raw data.

Beyond English: Multilingual Text and Unicode

Language diversity introduces another layer of complexity.

Consider:

Each language has its own script, structure, and encoding challenges.

Unicode: The Universal Standard

Unicode provides a consistent way to represent characters across languages.

Example:

‘é’ → U+00E9

Without Unicode, global NLP systems would be fragmented and inconsistent.

Challenges in Multilingual NLP

“Kal meeting hai bro”

Modern Solutions

These advancements allow a single model to understand multiple languages effectively.

Turning Words into Numbers: Word Embeddings

Once text is cleaned and tokenized, it must be converted into a numerical format.

Machines don’t understand words, they understand vectors.

One-Hot Encoding: The Beginning

Each word is represented as a vector with a single “1”:

“cat” → [0, 0, 1, 0, 0]

While simple, this approach has major drawbacks:

Word2Vec: Learning Meaning from Context

Word2Vec introduced the idea that meaning comes from context.

Words that appear in similar contexts have similar representations.

Example:

“king” – “man” + “woman” ≈ “queen”

This demonstrated that embeddings can capture semantic relationships.

GloVe: Combining Local and Global Context

GloVe (Global Vectors) uses co-occurrence statistics across the entire corpus.

It captures:

This results in richer representations.

FastText: Understanding Subwords

FastText improves embeddings by considering character-level information.

Instead of treating words as atomic units, it breaks them into n-grams.

This allows:

 Fig 1.2 Word Embeddings in NLP

Static vs Contextual Embeddings: A Paradigm Shift

Traditional embeddings assign a single vector per word.

Static Embeddings:

“bank” → same vector always

Problem:

Contextual Embeddings

Modern models generate embeddings dynamically.

“I went to the bank” → financial meaning 
“The river bank is wide” → geographical meaning  

Models like transformers capture context by analyzing surrounding words.

Why This Changed Everything

Contextual embeddings enable:

This shift is one of the biggest breakthroughs in NLP.

Measuring Meaning: Semantic Similarity

Once words are represented as vectors, we can measure how similar they are.

Cosine Similarity

Measures the angle between two vectors.

Applications

Analogy Tasks

Embeddings can solve relationships:

Paris : France :: Tokyo : Japan

This shows that models capture structure in language, not just words.

Traditional NLP: The Rule-Based Era

Before deep learning, NLP relied on handcrafted rules.

Rule-Based Systems

Examples:

Strengths:

Weaknesses:

Feature Engineering

Engineers manually design features:

Example:

“This movie is great” → positive sentiment

Features might include:

Limitations:

Neural NLP: Learning Everything from Data

Modern NLP replaces manual design with learning.

 Fig 1.3 Neural NLP

End-to-End Learning

Instead of:

Text → Features → Model → Output

We now have:

Text → Model → Output

The model learns:

Advantages:

Challenges:

Evaluating NLP Models: Beyond Accuracy

Evaluation is critical but complex.

Accuracy:

Useful for classification.

Precision:

Measures correctness of positive predictions.

Recall:

Measures completeness.

F1 Score: 

Balances precision and recall.

Task-Specific Metrics

BLEU: 

Used for translation.

BLEU: 

Used for summarization.

Perplexity: 

Measures language model uncertainty.

The Bigger Insight

Metrics don’t always reflect real-world performance.

A chatbot might score high on benchmarks but still feel unnatural.

Human evaluation remains essential.

The Complete NLP Pipeline

Let’s connect everything:

Each step plays a crucial role.

A mistake early in the pipeline can cascade into poor results later.

Why Representation is Everything

At its core, NLP is about representation.

The way we encode language determines:

Better representation leads to better intelligence.

The Future of NLP

The field continues to evolve rapidly.

Key Trends

Multimodal Models:

Combining text, images, and audio.

Unified Architectures

One model handling multiple tasks.

Low-Resource Learning

Performing well with limited data.

Longer Context Understanding

Handling entire documents instead of short snippets.

Final Reflection

When you interact with AI, it feels natural. Almost human.

But underneath, there is a complex pipeline:

What feels like understanding is actually mathematical pattern recognition at scale.

And it all begins with how we preprocess and represent text.

Closing Thought

If you’re building AI systems, remember:

Because in the end AI doesn’t understand language the way humans do.
It understands the representation of language we give it.

If you’re ready to go beyond theory and start your journey in Generative AI today, now is the perfect time to begin.

Scroll to Top