Vector Databases

The Memory Layer Every AI Application Needs

CodeKerdos.in | Gen-AI Blog Series | Week 4

Priya had built something she was genuinely proud of.

A RAG-based internal chatbot for her company’s HR team. Employees could ask it anything: leave policies, payroll cycles, appraisal timelines, onboarding checklists. The bot would search through 400 internal documents and come back with accurate, grounded answers in under two seconds. During the pilot with 20 employees, it worked beautifully. Her manager called it a game-changer. The rollout to all 1,200 employees was scheduled for the following Monday.

By Tuesday morning, her phone would not stop buzzing.

The response time had gone from 2 seconds to 14. Some queries timed out entirely. The bot was returning chunks from completely unrelated departments, mixing up finance policies with engineering leave rules. One employee had asked about the maternity leave policy and received a paragraph about server maintenance schedules.

Priya sat at her desk and stared at the logs. The retrieval was broken. But the code had not changed. The documents had not changed. The only thing that had changed was the number of users and the volume of queries hitting her system.

What she had built during the pilot was technically a RAG system. But the vector database she was using, an in-memory ChromaDB instance running inside her application process, was never designed to handle concurrent queries at that scale. It did not support the kind of filtering she needed. It had no indexing optimized for her query patterns. And when 200 employees hit it at the same time on Tuesday morning, it fell apart.

“I had built the AI layer properly. What I had never thought through was the data layer underneath it.”

That gap, between understanding how RAG works conceptually and understanding how to choose and configure the right vector database for production, is where a lot of real projects get stuck. This week, we close that gap.

What makes a vector database different from a regular database?

Every developer knows what a relational database does. You store structured data in rows and columns, and you query it with SQL. Lookups are fast because the database uses B-tree indexes that allow it to find exact matches or range comparisons in logarithmic time.

A vector database does something fundamentally different. Instead of storing structured rows, it stores high-dimensional vectors: lists of floating-point numbers that represent the meaning of a piece of text, an image, a product, or any other content that has been passed through an embedding model. And instead of looking for exact matches, it searches for approximate nearest neighbors, the vectors in the database that are most similar to a given query vector.

That is a completely different kind of problem. You cannot solve it with a B-tree. A naive approach of computing the distance between a query vector and every single stored vector would work for 1,000 documents. At 1 million documents it becomes unbearably slow. Vector databases exist specifically to make this search fast at scale, using specialized indexing algorithms designed for high-dimensional similarity search.

The core difference in one line

A relational database answers: does this row match exactly? A vector database answers: which stored items are closest in meaning to this query?

How vector indexes actually work

You do not need to implement these algorithms yourself, but understanding them at a high level will help you make smarter configuration decisions and debug performance problems when they appear.

Flat search: the baseline

The simplest approach is to store all vectors and, at query time, compute the distance between the query vector and every single stored vector. This is called a flat or brute-force search. It is perfectly accurate: you are guaranteed to find the true nearest neighbors every time.

The problem is speed. If you have 10 million vectors and each is 1,536 dimensions (OpenAI’s embedding size), the computation per query becomes enormous. Flat search is fine for small datasets in development. It is impractical for anything resembling production scale.

IVF: Inverted File Index

IVF is one of the earliest solutions to the scale problem. During indexing, the algorithm clusters all your vectors into groups, called cells or Voronoi regions, using a technique similar to k-means clustering. Each vector is assigned to the cluster whose centroid it is closest to.

At query time, instead of searching all vectors, the system first identifies which clusters are closest to the query vector, and then searches only within those clusters. This dramatically reduces the number of comparisons. The trade-off is that it is approximate: the true nearest neighbor might be in a neighboring cluster that you did not search. You control how many clusters to probe, which is a tunable balance between speed and accuracy.

HNSW: Hierarchical Navigable Small World

HNSW is currently the dominant algorithm in most modern vector databases, and for good reason. It builds a graph structure where each vector is a node connected to its approximate nearest neighbors. The graph is organized in multiple layers: the upper layers have long-range connections for fast navigation across the space, and the lower layers have dense local connections for precision.

At query time, the search starts at the top layer and navigates greedily toward the query vector, descending through layers as it narrows in on the answer. This gives HNSW excellent query speed, high recall (it rarely misses the true nearest neighbor), and good performance even on very high-dimensional vectors.

The trade-off is memory. HNSW stores the graph in RAM, which means your entire index needs to fit in memory for optimal performance. For very large datasets, this can be a significant infrastructure cost.

What this means for you in practice

Most managed vector databases handle index selection and configuration automatically. When you do need to tune:

The vector databases you will actually encounter

The ecosystem has grown quickly and the options can be overwhelming. Here is an honest breakdown of the main players, what they are good at, and where they fall short.

Pinecone

Pinecone is a fully managed, cloud-native vector database. You do not manage any infrastructure. You create an index, upload vectors, and query. It scales automatically, supports metadata filtering natively, and has a well-documented API that integrates cleanly with every major LLM framework.

The free tier is generous enough for prototyping and small production workloads. Pricing scales with the number of vectors and queries, which can become significant at high scale. The main limitation is that you have no control over the underlying infrastructure, which is a concern for teams with strict data residency requirements.

Best for: teams that want to move fast, do not want to manage infrastructure, and are building on cloud-native stacks.

ChromaDB

ChromaDB is open source and extremely easy to get started with. You can run it in-process (embedded directly in your application) or as a standalone server. It has a clean Python and JavaScript API and requires almost no configuration to get a working vector store running locally.

The limitation Priya ran into is real: ChromaDB’s in-process mode is not designed for concurrent access or large-scale production workloads. In server mode it performs significantly better, but it still lacks some of the advanced filtering and hybrid search capabilities of more mature options.

Best for: local development, prototyping, and small-scale applications where simplicity matters more than scale.

Weaviate

Weaviate is an open-source vector database with a cloud-managed option. Its standout feature is native hybrid search: it can combine vector similarity search with traditional keyword (BM25) search in a single query, which is genuinely useful for applications where users search using specific terms alongside semantic queries.

Weaviate also supports a schema-based approach to storing objects, which gives you more structure than a simple key-value store. It is a strong choice when your retrieval needs to combine meaning and keywords, such as in a product search or legal document system.

Best for: applications that need hybrid search, schema-enforced data, or teams comfortable with open-source infrastructure.

pgvector

pgvector is a PostgreSQL extension that adds vector storage and similarity search directly into the relational database you already know. You add a vector column to any PostgreSQL table, store your embeddings there, and query them using SQL with special similarity operators.

The appeal is enormous for teams already running PostgreSQL. You get vectors in the same database as your application data, with ACID transactions, familiar tooling, standard backups, and no new infrastructure to manage. You can join vector search results with relational data in a single query, which opens up filtering and enrichment patterns that are awkward in standalone vector databases.

The limitation is that PostgreSQL was not built as a vector database, and at very high scale (tens of millions of vectors with high query throughput), purpose-built databases will outperform it. But for a large majority of production use cases, pgvector is more than capable and significantly simpler to operate.

Best for: teams running PostgreSQL who want the lowest-friction path to adding vector search without a new infrastructure dependency.

Qdrant

Qdrant is an open-source vector database written in Rust, which gives it strong performance characteristics and low memory overhead. It has an excellent filtering system that allows you to filter by metadata before or after the vector search, which is critical for multi-tenant applications where different users should only see their own data.

Qdrant also supports payload indexing, which means you can create traditional indexes on metadata fields to make filtered searches dramatically faster. It offers both a cloud-managed option and a self-hosted path with good documentation.

Best for: teams that need advanced filtering, are comfortable self-hosting, or are building multi-tenant RAG systems.

Milvus

Milvus is a distributed, open-source vector database designed for very large scale: billions of vectors, high query throughput, and horizontal scaling across multiple nodes. It is the most powerful option in terms of raw scale, but also the most complex to deploy and operate.

Unless you are at a scale where simpler options have demonstrably failed, Milvus is probably more infrastructure than you need. It is worth knowing it exists for when you get there.

Metadata filtering: the feature that makes RAG production-ready

Pure vector search returns the most semantically similar chunks from across your entire index. In many real applications, that is not what you want. You want the most semantically similar chunks within a specific scope.

Consider Priya’s HR chatbot. An employee in the engineering department asks about leave policies. The ideal result is not just any chunk about leave policies: it is the chunk about leave policies specifically for engineering employees. Without filtering, the bot might return a leave policy for contract staff, or for a different regional office, or from a document that was valid two years ago but has since been superseded.

Metadata filtering solves this. When you index a document, you attach metadata to each chunk: things like department, document type, date, category, region, access level, or any other attribute relevant to your domain. When a user queries, your application adds filter conditions that restrict the search to chunks with matching metadata.

// Querying with metadata filters in Pinecone (Java SDK example)
QueryRequest queryRequest = QueryRequest.builder()
    .vector(queryEmbedding)
    .topK(5)
    .filter(Struct.newBuilder()
        .putFields("department",
            Value.newBuilder().setStringValue("engineering").build())
        .putFields("document_type",
            Value.newBuilder().setStringValue("policy").build())
        .putFields("valid",
            Value.newBuilder().setBoolValue(true).build())
        .build())
    .build();

That query will only search among chunks tagged as current engineering policy documents. The result quality improves dramatically, not because the semantic search got smarter, but because you scoped the search to the right subset of your data.

Metadata design is one of the most underrated skills in building RAG systems. Spend time thinking about what attributes your users will implicitly or explicitly filter by, and index those attributes alongside your vectors from the beginning. Retrofitting metadata into an existing index is painful.

Hybrid search: when semantics alone is not enough

Pure vector search is powerful, but it has a known weakness. It sometimes misses results that contain specific keywords or exact terms that the user is looking for, especially for things like product codes, legal clause numbers, API endpoint names, or any other highly specific identifiers that do not have obvious semantic neighbors.

Imagine a developer asking: “What does the error code E4029 mean?” A semantic search for that question might return chunks about error handling in general. What the developer actually needs is the specific document that mentions E4029 by name. A traditional keyword search would find it instantly.

Hybrid search combines both approaches. It runs a vector similarity search and a keyword search in parallel, then merges the results using a ranking algorithm. The most common merging technique is called Reciprocal Rank Fusion, which gives a combined score based on each result’s rank in both lists.

The result is a system that handles both types of queries well: vague, meaning-based questions get good semantic results, and specific, keyword-based queries surface the right exact matches.

Weaviate has native hybrid search. Qdrant supports it. pgvector can achieve it by combining a vector similarity query with a PostgreSQL full-text search in the same SQL statement. If your application handles diverse query types, hybrid search is worth the added complexity.

pgvector with Spring Boot: a practical walkthrough

Since most CodeKerdos readers work with Java and Spring Boot, let us walk through setting up pgvector as your vector store with Spring AI.

Step 1: Enable pgvector on your PostgreSQL instance

-- Run this in your PostgreSQL database
CREATE EXTENSION IF NOT EXISTS vector;
 
-- Create the table for your document chunks
CREATE TABLE document_chunks (
    id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    content     TEXT NOT NULL,
    embedding   VECTOR(1536),
    department  TEXT,
    doc_type    TEXT,
    source_file TEXT,
    created_at  TIMESTAMP DEFAULT now()
);
 
-- Create an HNSW index for fast similarity search
CREATE INDEX ON document_chunks
    USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 64);

Step 2: Configure Spring AI in your application

# application.yml
spring:
  ai:
    openai:
      api-key: ${OPENAI_API_KEY}
      embedding:
        options:
          model: text-embedding-3-small
  datasource:
    url: jdbc:postgresql://localhost:5432/yourdb
    username: ${DB_USER}
    password: ${DB_PASS}
  ai:
    vectorstore:
      pgvector:
        index-type: hnsw
        distance-type: cosine_distance
        dimensions: 1536

Step 3: Indexing documents

@Service
public class DocumentIndexingService {
 
    private final VectorStore vectorStore;
 
    public void indexDocument(String filePath, String department, String docType) {
        // Load and parse your document
        String rawText = documentLoader.load(filePath);
 
        // Split into chunks (Spring AI has built-in text splitters)
        TokenTextSplitter splitter = new TokenTextSplitter(500, 50);
        List chunks = splitter.apply(
            List.of(new Document(rawText))
        );
 
        // Attach metadata to each chunk
        chunks.forEach(chunk -> {
            chunk.getMetadata().put("department", department);
            chunk.getMetadata().put("doc_type", docType);
            chunk.getMetadata().put("source_file", filePath);
        });
 
        // Store: Spring AI embeds automatically and saves to pgvector
        vectorStore.add(chunks);
    }
}

Step 4: Querying with filters

@Service
public class RagQueryService {
 
    private final VectorStore vectorStore;
    private final ChatClient chatClient;
 
    public String answer(String question, String department) {
 
        // Build a filter expression for metadata
        FilterExpressionBuilder filter = new FilterExpressionBuilder();
        Expression deptFilter = filter.eq("department", department).build();
 
        // Retrieve top 4 relevant chunks for this department
        List docs = vectorStore.similaritySearch(
            SearchRequest.query(question)
                .withTopK(4)
                .withFilterExpression(deptFilter)
        );
 
        String context = docs.stream()
            .map(Document::getContent)
            .collect(Collectors.joining("\n\n"));
 
        String prompt = """
            Answer using only the context below.
            If unsure, say you do not have that information.
 
            Context: %s
 
            Question: %s
            """.formatted(context, question);
 
        return chatClient.prompt(prompt).call().content();
    }
}

With that setup, Priya’s chatbot would have returned engineering-specific policies to engineering employees and finance policies to the finance team, with no cross-contamination. The metadata filter is doing the scoping that the vector search alone could never do.

Production considerations you cannot ignore

Understanding Structs is often the point where Go begins to feel directly relevant to Kubernetes engineering.

Latency

Vector search is fast but not free. For applications where every millisecond matters, keep your index in memory (HNSW is in-memory by default in most databases), co-locate your vector database with your application servers to minimize network latency, and use connection pooling if you are querying a standalone vector database server.

Cache frequent queries where appropriate. If your application has a predictable set of common questions, caching the retrieved chunks for those queries can eliminate the retrieval step entirely for a significant portion of your traffic.

Index size and memory

A rough rule of thumb for HNSW: plan for roughly 4 to 8 bytes per dimension per vector, plus overhead for the graph connections. For 1 million vectors at 1,536 dimensions, expect your index to consume roughly 10 to 20 GB of RAM. Plan your infrastructure accordingly before you hit those numbers in production.

Keeping your index up to date

Build an incremental indexing pipeline from day one. When a document is updated, delete the old chunks by their source file metadata and re-index the new version. When a document is deleted, clean up its vectors. Stale vectors in your index are a silent accuracy killer that is hard to diagnose.

Multi-tenancy

If your application serves multiple customers or departments, never mix their vectors in the same namespace without filtering. Use separate namespaces (Pinecone), collections (ChromaDB, Qdrant), or schema-level separation, so that one tenant’s data cannot appear in another tenant’s search results. This is both a correctness issue and a data privacy issue.

How to choose the right vector database for your project

The decision comes down to a few key questions:

  1. Are you already running PostgreSQL? If yes, start with pgvector. The operational simplicity is worth more than marginal performance differences at most scales.
  2. Do you need to move fast without managing infrastructure? Use Pinecone. The free tier gets you to production. Upgrade when you hit limits.
  3. Do you need hybrid search out of the box? Weaviate is the most mature option here, with native BM25 and vector fusion.
  4. Are you building multi-tenant with complex filtering requirements? Qdrant’s payload indexing and filtering system is the strongest in this category.
  5. Are you at a scale that has broken everything else? Look at Milvus. But exhaust simpler options first.

A practical recommendation for most CodeKerdos readers

Start with pgvector if you are on PostgreSQL, or ChromaDB in server mode for development. When you are ready for production and your team does not want to manage infrastructure, migrate to Pinecone. Add metadata filtering from the beginning. Add hybrid search only if you have confirmed that pure vector search is missing important results for your specific use case.

What is coming in Week 5?

You now understand RAG and you understand the data layer that powers it. The next natural question is: what if the model needs to do more than just retrieve and answer? What if it needs to decide which tool to use, take a sequence of actions, check its own work, and try again if something went wrong?

That is the territory of AI agents, and it is where Gen AI starts to feel genuinely powerful in ways that go beyond chatbots. In Week 5, we cover LangChain: the framework that makes it practical to build agents and multi-step AI pipelines without reinventing every piece from scratch. We will look at how chains and agents work, where LangChain fits into a Java developer’s world, and when it is worth using versus building your own orchestration.

Key Takeaway

A vector database is not just a storage detail. It is the foundation your entire RAG system stands on. Choose it based on your scale, your filtering requirements, and your operational constraints. Design your metadata schema carefully from day one. And never mistake a development-grade setup for a production-ready one: those are different problems, and they need different solutions.

Follow the full series at codekerdos.in

New post every week. Practical Gen-AI content built specifically for Java and Spring Boot developers. Join our WhatsApp community for early access, Q&A sessions, and hands-on coding exercises.

Scroll to Top