Street Learner

Author

9 min read

Last Updated: a year ago

Retrieval Augmented Generation (RAG): A Deep, End-to-End Guide with LangChain

Introduction: Why RAG Exists and Why You Need It

Large Language Models (LLMs) like GPT-4 are powerful, but they suffer from three fundamental limitations:

They do not know your private or latest data – an LLM cannot answer questions about your PDFs, internal documents, or databases unless that information is explicitly provided at runtime.
They hallucinate – when an LLM is unsure, it may confidently generate incorrect information.
They lack traceability – answers are not grounded in verifiable sources.

Retrieval-Augmented Generation (RAG) is the architectural pattern designed to solve these problems.

At a high level, RAG combines:

Retrieval: Finding relevant information from your own data
Generation: Using an LLM to generate answers strictly based on that retrieved information

Instead of asking the LLM to "know everything," RAG teaches it how to look things up first, then answer.

This blog explains RAG from the ground up, connects every concept logically, and demonstrates each step using real LangChain code.

What Is RAG? Conceptual Overview

RAG is not a single function or library call. It is a pipeline made of three mandatory stages:

Indexing – Preparing your data so it can be searched efficiently
Retrieval – Selecting the most relevant pieces of that data for a query
Generation – Producing an answer using only the retrieved context

If any of these stages is weak or missing, the entire system fails.

RAG Stage 1: Indexing (Preparing Knowledge for Retrieval)

Indexing is the most critical and most misunderstood part of RAG. This stage determines what the model can possibly know.

Indexing itself is composed of four sub-steps:

Loading data
Cleaning and normalizing data
Splitting data into chunks
Embedding and storing chunks in a vector database

Let’s walk through each one carefully.

1. Loading Data: Where Knowledge Comes From

LLMs cannot directly read files. We must explicitly load content and convert it into text.

LangChain provides document loaders for common formats like PDF and DOCX.

from langchain_community.document_loaders import PyPDFLoader, Docx2txtLoader

loader_pdf = PyPDFLoader("Introduction_to_Data_and_Data_Science.pdf")
pages_pdf = loader_pdf.load()

loader_docx = Docx2txtLoader("Introduction_to_Data_and_Data_Science.docx")
pages_docx = loader_docx.load()

Each loaded item becomes a Document object containing:

page_content: the extracted text
metadata: information like page number, source file, or headings

This metadata becomes extremely important later for traceability and citations.

2. Cleaning and Normalizing Text

Raw documents often contain:

Extra whitespace
Line breaks
Page headers and footers

These artifacts reduce embedding quality. Before splitting, we normalize the text.

import copy

pages_clean = copy.deepcopy(pages_pdf)
for page in pages_clean:
    page.page_content = ' '.join(page.page_content.split())

This ensures embeddings are based on semantic meaning, not formatting noise.

3. Text Splitting: Why Chunking Is Mandatory

LLMs and embedding models have context limits. You cannot embed or retrieve an entire book as one piece.

Chunking solves three problems:

Keeps embeddings semantically focused
Improves retrieval accuracy
Prevents token overflow during generation

Character-Based Splitting

from langchain_text_splitters.character import CharacterTextSplitter

char_splitter = CharacterTextSplitter(
    separator=".",
    chunk_size=500,
    chunk_overlap=50
)

chunks = char_splitter.split_documents(pages_docx)

Each chunk:

Is small enough to embed efficiently
Overlaps slightly to avoid losing context at boundaries

Structure-Aware Splitting (Markdown Headers)

For structured documents, splitting by headers preserves semantic hierarchy.

from langchain_text_splitters.markdown import MarkdownHeaderTextSplitter

md_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=[("#", "Course Title"), ("##", "Lecture Title")]
)

This ensures chunks remain tied to logical sections, not arbitrary lengths.

4. Embeddings: Converting Text into Vectors

Embeddings translate human language into numerical vectors such that semantic similarity becomes mathematical distance.

from langchain_openai.embeddings import OpenAIEmbeddings

embedding_model = OpenAIEmbeddings(model="text-embedding-ada-002")
vector = embedding_model.embed_query(chunks[0].page_content)

Key properties:

Similar meanings → closer vectors
Different meanings → distant vectors

This is the foundation of semantic search.

5. Vector Stores: Persistent Semantic Memory

Embeddings alone are useless unless stored and indexed.

A vector database allows:

Fast similarity search
Persistence across sessions
Metadata filtering

from langchain_community.vectorstores import Chroma

vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embedding_model,
    persist_directory="./intro-to-ds-lectures"
)

You can also add, update, or delete documents dynamically.

from langchain_core.documents import Document

new_doc = Document(
    page_content="Analysis is retrospective, analytics is predictive.",
    metadata={"Lecture Title": "Analysis vs Analytics"}
)
vectorstore.add_documents([new_doc])

At this point, indexing is complete.

RAG Stage 2: Retrieval (Finding the Right Knowledge)

Retrieval determines what context the LLM is allowed to see.

Similarity Search

docs = vectorstore.similarity_search(
    "What tools do data scientists use?",
    k=2
)

This retrieves the most semantically similar chunks.

Max Marginal Relevance (MMR)

MMR balances relevance and diversity.

docs = vectorstore.max_marginal_relevance_search(
    "What tools do data scientists use?",
    k=2,
    lambda_mult=0.5
)

This prevents redundant chunks and improves coverage.

Retriever Abstraction

retriever = vectorstore.as_retriever(
    search_type='mmr',
    search_kwargs={'k': 3}
)
retrieved_docs = retriever.invoke(query)

Retrievers make retrieval reusable and composable inside chains.

RAG Stage 3: Generation (Answering with Grounded Context)

Generation combines retrieved documents with a carefully designed prompt.

from langchain_core.prompts import PromptTemplate

TEMPLATE_RAG = '''
Answer the question using ONLY the context below.

Question:
{question}

Context:
{context}

Cite the lecture titles at the end.
'''

prompt_rag = PromptTemplate.from_template(TEMPLATE_RAG)

Building the RAG Chain

from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

rag_chain = (
    {'context': retriever, 'question': RunnablePassthrough()}
    | prompt_rag
    | chat
    | StrOutputParser()
)

Executing the RAG Pipeline

response = rag_chain.invoke("What software do data scientists use?")
print(response)

The model now:

Retrieves relevant chunks
Injects them into the prompt
Generates an answer grounded in your data

Why This Architecture Works

RAG succeeds because it:

Separates knowledge from reasoning
Eliminates hallucinations
Scales to large private datasets
Provides traceability and trust

This is the foundation of modern AI systems used in:

Chatbots over PDFs
Internal knowledge assistants
Customer support automation
Research copilots

Final Thoughts

RAG is not optional if you are building serious LLM applications.

Understanding each component deeply—loading, splitting, embedding, storing, retrieving, and generating—is the difference between a demo and a production-grade system.

Once you master this pipeline, you can confidently build AI systems that are accurate, explainable, and scalable.