What is Retrieval-Augmented Generation (RAG)?

RAG is a technique that enhances LLM responses by first retrieving relevant documents from an external knowledge base, then providing those documents as context in the prompt. This grounds the model's response in up-to-date, verifiable information without requiring retraining.

How do you choose chunk size for RAG?

Smaller chunks (128–256 tokens) improve retrieval precision — retrieved chunks are more focused. Larger chunks (512–1024 tokens) provide more context and reduce information loss across chunk boundaries. For most use cases, 256–512 tokens with 10–15% overlap works well. The optimal size depends on your specific documents and query patterns — always benchmark with your actual data.

What is a re-ranker and when should you use it?

A re-ranker (cross-encoder) takes the top-K retrieved chunks and scores each one against the query more precisely than the initial embedding similarity did. It is slower but more accurate. Use re-ranking when recall is high but precision is low — you are retrieving relevant chunks, but the most relevant ones are not making it into the final top-3 that gets sent to the LLM.

What is the difference between dense and sparse retrieval?

Dense retrieval uses embedding similarity (semantic search) — good for paraphrase matching and conceptual queries. Sparse retrieval (BM25/TF-IDF) uses exact keyword matching — good for named entities, product codes, and specific terms. Hybrid search combines both, typically outperforming either alone, especially when queries mix conceptual and specific terms.

RAG & Vector Database Interview Questions for 2025

Why RAG Interviews Are Technical Deep-Dives

RAG is now a core production pattern, not an advanced topic. AI Engineer interviews at serious companies go deep: they expect you to know not just what RAG is, but how to tune it, debug it, evaluate it, and build it at scale. These questions reflect what is actually asked in technical interviews at AI-native startups and MNC AI teams.

RAG Architecture Questions

Q: Walk me through the complete RAG pipeline.

Sample answer: The pipeline has two phases: offline indexing and online inference. Offline: documents are loaded from source (files, databases, URLs), split into chunks using a chunking strategy, each chunk is encoded into a dense vector embedding using an embedding model, and the vectors plus metadata are stored in a vector database. Online: the user query is encoded using the same embedding model, the vector store is queried for top-K nearest neighbours, optionally re-ranked by a cross-encoder, the top chunks are formatted into a context window, and the LLM generates an answer grounded in that context. The response is returned to the user along with optional source citations.

Q: How do you handle documents that are larger than the context window?

Sample answer: Several strategies. First, chunking — split documents into smaller pieces so only relevant chunks are retrieved. Second, the Map-Reduce pattern — process each chunk independently and combine summaries (useful for summarisation tasks, not Q&A). Third, iterative retrieval — retrieve an initial set of chunks, generate an intermediate answer, then retrieve again based on that answer to fill gaps (used in Corrective RAG and Self-RAG). Fourth, hierarchical indexing — index document summaries separately from full content; retrieve summary first to identify relevant documents, then retrieve specific chunks from those documents.

Q: What is the parent document retriever pattern?

Sample answer: The parent document retriever indexes small chunks for retrieval precision (so similarity search finds the exact relevant passage) but returns the larger parent chunk to the LLM for generation context (so the model has sufficient surrounding text to understand and use the retrieved passage). It solves a classic trade-off: small chunks retrieve precisely but lack context; large chunks have context but are noisy for retrieval. In LangChain, this is implemented as a ParentDocumentRetriever.

Vector Database Questions

Q: What is the difference between Pinecone, Weaviate, and pgvector? How do you choose?

Sample answer: Pinecone is a fully managed vector database — easiest to get started, scales automatically, but is a paid external service and your data is not self-hosted. Weaviate is open-source and can be self-hosted or used as a cloud service; it supports hybrid search natively and has a rich schema/filtering system. pgvector is a Postgres extension — if you are already on Postgres, it adds vector similarity search without a new service. The trade-off is that Postgres is not purpose-built for ANN search and may be slower at very large scale. For startups, pgvector (operationally simple) or Pinecone (fully managed) make sense. For enterprises with data residency requirements, self-hosted Weaviate or Qdrant. At very large scale (hundreds of millions of vectors), purpose-built systems like Pinecone or Milvus outperform pgvector.

Q: What is HNSW and why does it matter for vector search?

Sample answer: Hierarchical Navigable Small World (HNSW) is an approximate nearest neighbour algorithm that builds a multi-layer graph structure enabling very fast similarity search at the cost of some accuracy (it is approximate, not exact) and higher memory usage. Most production vector databases use HNSW because it achieves sub-millisecond search over millions of vectors — exact search would be too slow. The key parameters are ef_construction (index build quality, higher = more accurate but slower to build) and ef (search quality at query time, higher = more accurate but slower queries). Understanding this helps you tune the accuracy/speed trade-off for your specific use case.

Evaluation Questions

Q: What RAGAS metrics do you use and what does each measure?

Sample answer: RAGAS provides four key metrics. Faithfulness measures whether the generated answer is supported by the retrieved context — catches hallucinations where the model adds facts not present in the retrieved documents. Answer relevance measures whether the answer actually addresses the question asked — catches off-topic responses even when they are factually grounded. Context precision measures whether the retrieved contexts are relevant — high recall + low precision means you are retrieving noise. Context recall measures whether all information needed to answer the question was actually retrieved — low recall means your retrieval is missing key passages. In production, I track faithfulness and answer relevance as the primary quality metrics and context precision/recall for diagnosing retrieval problems.

Q: How do you debug a RAG system that is giving wrong answers?

Sample answer: I diagnose by isolating the retrieval and generation stages separately. First, check retrieval: log the retrieved chunks for failing queries and manually inspect whether the right content was retrieved. If not retrieved — embedding model may not suit the domain, chunks are too large/small, or the query needs reformulation (HyDE — generate a hypothetical answer and use that for retrieval instead of the raw query). If retrieved correctly but wrong answer — the issue is in generation: context is too noisy (add re-ranking), the prompt is not instructing the model to stay within context (strengthen grounding instruction), or the answer requires combining multiple retrieved chunks (use Map-Reduce or multi-hop retrieval). Adding structured logging at every stage — retrieved chunks, similarity scores, re-ranker scores, generation input/output — is essential for systematic debugging.