Home /
Gen AI /
RAG Explained

RAG Explained

How AI Finally Gets Smarter With Your Data

CodeKerdos.in | Gen-AI Blog Series | Week 3

Rohan had been up until 2 AM the night before the demo.

Three weeks of work. A fully functional AI chatbot, integrated into his company’s internal portal, built on top of GPT-4, with a clean UI and a snappy response time. His manager was impressed just from the screenshots. The CTO had been looped in. There were six people in the meeting room when Rohan opened his laptop and pulled up the chatbot on the projector screen.

“Ask it anything about our product,” he said, leaning back with the quiet confidence of someone who had tested this thing a hundred times.

His manager typed the first question: “What is the refund process for enterprise customers who signed contracts before March 2024?”

The bot thought for a second. Then it answered. Confidently. Fluently. In perfect, professional English.

And it was completely, embarrassingly wrong.

It had described a refund process that did not exist. It had mentioned a 30-day window that the company had never offered. It had cited a “standard enterprise clause” that Rohan had never seen in any internal document.

The CTO leaned forward. “Where is it getting this from?”

Rohan had no good answer. Because the truth was, the model was not getting it from anywhere. It was making it up. Fluently. Politely. And with complete confidence.

That is the fundamental problem with using a raw LLM for anything that requires knowledge of your specific business, your documents, your policies, or your data. The model was trained on the internet. It knows a lot of things. But it has never read your internal wiki, your product documentation, your legal agreements, or your support playbooks.

When you ask it something specific, it does not say “I don’t know.” It fills the gap with the most plausible-sounding answer it can generate. That behavior has a name: hallucination. And in a live demo in front of your CTO, it is one of the most painful things a developer can experience.

“So how do you give the model access to your actual data, without retraining it from scratch?”

That question is exactly what RAG was built to answer.

What is RAG and why does it exist?

RAG stands for Retrieval Augmented Generation. The name tells you exactly what it does: before the model generates an answer, it first retrieves relevant information from a source you control. That retrieved information is then passed into the prompt as context, so the model is answering based on your actual data rather than guessing from its training.

This solves three problems that make raw LLMs impractical for most real-world business applications:

Knowledge cutoff: LLMs are trained up to a certain date. They know nothing about events, policy changes, or product updates that happened after that cutoff.
Private data: The model was not trained on your company's internal documents, your database, your product catalogue, or your support tickets. It cannot know what it never saw.
Hallucination: When the model does not know something, it often generates a confident-sounding answer anyway. RAG replaces that guess with grounded, verifiable information.

RAG does not require you to retrain or fine-tune the model. You are not touching the model weights at all. You are simply changing what information you hand to the model at the time it answers. That is a critical distinction, and it is what makes RAG so practical and so widely adopted.

The RAG pipeline: how it works step by step

A RAG system has two distinct phases: an indexing phase that runs once (or periodically when your data changes), and a query phase that runs every time a user asks a question.

Phase 1: Indexing your documents

This is the preparation step. You take all the documents you want the model to know about, and you process them into a form that allows fast, meaningful search later.

Load your documents: These can be PDFs, Word files, web pages, database records, support tickets, API responses, or any other text-based content.
Split them into chunks: You break each document into smaller pieces, typically around 300 to 500 words each. This is because the model has a context window limit, and you only want to pass the most relevant chunks, not entire documents.
Convert each chunk into an embedding: An embedding is a list of numbers, a vector, that represents the semantic meaning of that chunk of text. Two chunks that mean similar things will have vectors that are mathematically close to each other, even if they use completely different words.
Store the embeddings in a vector database: This is a special type of database optimized for storing and searching embeddings. Popular options include Pinecone, Weaviate, ChromaDB, and pgvector for PostgreSQL.

Phase 2: Answering a user query

This is the runtime phase. Every time a user asks a question, your application runs the following steps:

Embed the query: You convert the user’s question into an embedding vector using the same embedding model you used during indexing.
Search the vector database: You find the chunks whose embeddings are most similar to the query embedding. This is called a similarity search or nearest neighbor search. You typically retrieve the top 3 to 5 most relevant chunks.
Build the augmented prompt: You take those retrieved chunks and inject them into your prompt as context. You tell the model: here is relevant information from our documents, now answer the user’s question based on this.
Generate the answer: The model reads the retrieved context and the user’s question, and produces an answer grounded in the information you provided.

The key insight

The model is not searching the database. Your application is. The model only ever sees the retrieved text and the user’s question. It generates a response based on what you put in the prompt. RAG is about controlling what goes into that prompt, not about giving the model internet access.

Understanding embeddings: the heart of RAG

Embeddings are the concept most developers struggle with when they first encounter RAG. Let us take a moment to make it completely clear.

An embedding is a numerical representation of text. When you pass a sentence to an embedding model, it returns a list of numbers, usually hundreds or thousands of them, that capture the meaning of that sentence in a mathematical space.

Here is why that is powerful. Consider these two sentences:

Sentence A: "What is the return policy for enterprise customers?"
Sentence B: "How can a business client get a refund?"

These sentences use completely different words. A traditional keyword search would not match them. But an embedding model understands that they are semantically asking the same thing. Their embedding vectors will be very close to each other in the mathematical space.

This is the core superpower of vector search over keyword search. You are searching by meaning, not by exact words. A user can ask a question in ten different ways and still get the right documents back.

How similarity is measured

The most common way to measure how similar two vectors are is called cosine similarity. Without going into the math, you can think of it as measuring the angle between two arrows in a very high-dimensional space. The smaller the angle, the more similar the meaning. A cosine similarity of 1.0 means identical meaning. A score of 0 means completely unrelated.

Your vector database does this calculation for every chunk in your index every time a user asks a question, and returns the top matches ranked by score. In practice, this search is extremely fast, even across millions of chunks, because vector databases use specialized indexing algorithms optimized for this exact operation.

What a RAG prompt actually looks like

After retrieval, your application assembles a prompt and sends it to the LLM. Here is a simplified but realistic example of what that assembled prompt looks like:

System:
  You are a customer support assistant for CodeKerdos.in.
  Answer questions using ONLY the information provided in the context below.
  If the context does not contain enough information to answer the question,
  say: "I do not have specific information about that. Please contact our support team."
  Do not make up information that is not in the context.
 
Context (retrieved from your knowledge base):
  [Chunk 1]: Enterprise customers who signed contracts before March 2024 are covered
  under the legacy refund policy. Refund requests must be submitted within 45 days
  of the billing date via the enterprise support portal.
 
  [Chunk 2]: To initiate a refund, enterprise customers should contact their
  dedicated account manager or email enterprise-support@codekerdos.in with
  their contract ID and a brief description of the issue.
 
User Question:
  What is the refund process for enterprise customers who signed before March 2024?

Now the model has the actual answer right in front of it. It is not guessing. It is reading and summarizing. The response will be accurate, specific, and grounded in your real documentation. Rohan’s demo disaster would never have happened.

Chunking strategy: the detail most tutorials skip

How you split your documents into chunks has a massive impact on the quality of your RAG system. Most tutorials gloss over this. In practice, poor chunking is the number one reason RAG systems return irrelevant results.

Fixed-size chunking

The simplest approach is to split documents every N characters or every N words. It is fast and easy to implement. The problem is that it often splits sentences and paragraphs in the middle, which breaks context and makes chunks harder for the embedding model to understand meaningfully.

Sentence or paragraph-based chunking

A better approach is to split at natural boundaries: sentence endings, paragraph breaks, or section headers. This preserves the coherence of each chunk and generally produces better embeddings because the text is self-contained.

Overlapping chunks

A common technique is to add overlap between chunks, meaning the last few sentences of one chunk are repeated at the beginning of the next. This ensures that information at a boundary between two chunks is not lost. Typical overlap values are 10 to 20 percent of the chunk size.

Semantic chunking

The most advanced approach uses embeddings to determine where to split. You embed each sentence, and when the semantic similarity between consecutive sentences drops significantly, that is a natural break point. This is slower to compute but produces the highest quality chunks for complex documents.

For most production applications, paragraph-based chunking with some overlap is a strong starting point. Experiment with your specific documents and measure retrieval quality before investing in more sophisticated strategies.

Choosing an embedding model

Not all embedding models are equal, and the choice matters more than most developers realize. Here are the main options you will encounter:

OpenAI text-embedding-3-small and text-embedding-3-large

OpenAI’s embedding models are among the most widely used. They are easy to call via API, produce high-quality embeddings, and integrate cleanly with other OpenAI tools. The small variant is fast and cost-effective for most use cases. The large variant produces higher-dimensional embeddings and performs better on complex retrieval tasks. The trade-off is cost and latency.

Sentence Transformers (open source)

The Sentence Transformers library from Hugging Face offers a wide range of open-source models that you can run locally or on your own infrastructure. Models like all-MiniLM-L6-v2 are small, fast, and free to run. If data privacy is a concern and you cannot send documents to an external API, these models are an excellent choice.

Google and Cohere embeddings

Google’s text-embedding-gecko and Cohere’s Embed v3 are strong alternatives, particularly for multilingual use cases. If your documents are in Hindi, regional languages, or a mix of languages, these models often outperform OpenAI’s offerings on non-English content.

One important rule: always use the same embedding model to index your documents and to embed user queries. If you index with model A and query with model B, the vectors will be in different mathematical spaces and similarity scores will be meaningless.

Vector databases: where your knowledge lives

A vector database is purpose-built to store, index, and search embedding vectors efficiently. It is the memory layer of your RAG system. Here are the main options:

Pinecone: A fully managed cloud-based vector database. Easy to set up, scales well, and has a generous free tier. Good choice for getting started quickly.
ChromaDB: Open source, lightweight, and easy to run locally. Excellent for development and smaller production workloads. Can be embedded directly in your application process.
Weaviate: Open source with a cloud option. Strong support for hybrid search, combining vector search with traditional keyword filters.
pgvector: A PostgreSQL extension that adds vector search capabilities to a database you already know how to operate. If your team is already on PostgreSQL, this is often the lowest-friction path.
Qdrant: A newer open-source option with a strong performance reputation and good support for filtering during search.

If you are just starting out and want to move fast, ChromaDB for local development and Pinecone for production is a common and well-proven combination. If your team is already running PostgreSQL, pgvector is worth a serious look because it eliminates a new infrastructure dependency.

Implementing RAG with Java and Spring Boot

If you are coming from a Java background, you might be wondering whether RAG is only for Python developers. It is not. The Java ecosystem has matured significantly, and you have good options.

Spring AI

Spring AI is the official Spring project for AI integration. It provides abstractions for LLMs, embedding models, and vector stores that feel familiar to any Spring Boot developer. It supports multiple LLM providers (OpenAI, Anthropic, Azure, Ollama) and multiple vector stores (Pinecone, pgvector, Chroma, and others) through a consistent API.

// Adding Spring AI to your Spring Boot project
// pom.xml

  org.springframework.ai
  spring-ai-openai-spring-boot-starter


  org.springframework.ai
  spring-ai-pgvector-store-spring-boot-starter

A simple RAG flow in Spring AI

@Service
public class RagService {
 
    private final VectorStore vectorStore;
    private final ChatClient chatClient;
 
    public String answer(String userQuestion) {
 
        // Step 1: Retrieve relevant documents
        List docs = vectorStore.similaritySearch(
            SearchRequest.query(userQuestion).withTopK(4)
        );
 
        // Step 2: Build context from retrieved chunks
        String context = docs.stream()
            .map(Document::getContent)
            .collect(Collectors.joining("\n\n"));
 
        // Step 3: Build the augmented prompt
        String prompt = """
            Answer the question using only the context below.
            If the answer is not in the context, say you do not know.
 
            Context:
            %s
 
            Question: %s
            """.formatted(context, userQuestion);
 
        // Step 4: Generate answer
        return chatClient.prompt(prompt).call().content();
    }
}

This is simplified for clarity, but it captures the real structure of a RAG service. The vector store does the retrieval. Your code assembles the context. The chat client generates the answer. In a real application you would add error handling, logging, and caching, but the core logic is exactly this.

Common mistakes in RAG implementations

Retrieving too many chunks

More context is not always better. Passing 10 or 15 chunks into the prompt bloats your token count, increases cost, and can actually confuse the model with too much information. Start with the top 3 to 5 most relevant chunks and measure answer quality. Only increase if you are consistently missing relevant information.

Ignoring chunk quality during indexing

Garbage in, garbage out. If your source documents are poorly formatted, full of tables that do not parse cleanly, or contain navigation menus and boilerplate mixed with actual content, your chunks will be noisy and your embeddings will be poor. Clean your documents before indexing. Remove headers, footers, and non-content text. The quality of your retrieval is only as good as the quality of what you indexed.

Not re-indexing when documents change

Your vector database is a snapshot of your documents at the time you indexed them. If your policies change, your product documentation is updated, or new content is added, you need to re-index. Build a pipeline that handles incremental updates, not just a one-time bulk load.

Skipping evaluation

The most common mistake is shipping a RAG system without measuring how well it actually retrieves. Build a small test set of questions with known correct answers. Run your retrieval pipeline and measure whether the right chunks are being returned. If the retrieval is off, fix chunking and embedding strategy before touching the generation layer.

What is next in this series?

Now that you understand how RAG works at a conceptual and architectural level, the next step is understanding the infrastructure that makes it possible at scale: vector databases.

In Week 4, we go deep on vector databases. We will cover how they work internally, how to choose the right one for your use case, and how to structure your data so that retrieval stays fast and accurate as your knowledge base grows. If RAG is the technique, the vector database is the engine that powers it.

Key Takeaway

RAG is the difference between a chatbot that sounds smart and a chatbot that actually is smart about your business. It does not require retraining the model. It requires giving the model the right information at the right time. Get your chunking right, choose your embedding model carefully, and build your retrieval pipeline with the same discipline you bring to the rest of your application stack.

Follow the full series at codekerdos.in

New post every week. From RAG foundations to building production-grade AI features with Java and Spring Boot. Join our WhatsApp community for live Q&A and hands-on exercises.