Question-Answer Models with BERT: A Complete Guide
Question Answering (QA) models have become a crucial part of AI-driven applications. They allow machines to read text and provide accurate answers to questions in natural language. Among the most widely used models for QA are BERT (Bidirectional Encoder Representations from Transformers) and its derivatives like RoBERTa and DistilBERT.
In this guide, we’ll explore BERT architecture, embeddings, QA pipelines, and provide detailed code explanations.
1. BERT vs GPT: Understanding the Difference
Before diving into QA models, it’s essential to understand the difference between BERT and GPT:
BERT is a bidirectional transformer model. It reads text both left-to-right and right-to-left, which allows it to understand context deeply. BERT is ideal for tasks like question answering, named entity recognition, and sentence classification.
GPT is a unidirectional, autoregressive model, primarily used for text generation tasks. GPT predicts the next word in a sequence and is excellent for chatbots, story generation, and completion tasks.
Key takeaway: BERT excels in understanding context for tasks like QA, while GPT excels in generating coherent text.
2. BERT Architecture and Embeddings
BERT is based on the Transformer architecture, which relies on self-attention mechanisms to model relationships between all words in a sentence simultaneously. Key components include:
Input Embeddings: Combines token embeddings, position embeddings, and segment embeddings.
Encoder Layers: Multiple transformer encoder layers (12 for BERT-base, 24 for BERT-large) process these embeddings bidirectionally.
Output Layers: Can be fine-tuned for specific tasks like QA, classification, or NER.
BERT embeddings capture rich contextual information. The same token can have different embeddings depending on surrounding words, which makes BERT extremely powerful for understanding natural language.
3. Question Answering with BERT
BERT can be fine-tuned on datasets like SQuAD (Stanford Question Answering Dataset) to create QA models. Let’s explore the example:
from transformers import BertForQuestionAnswering, BertTokenizer
import torch
model_name = "bert-large-uncased-whole-word-masking-finetuned-squad"
model = BertForQuestionAnswering.from_pretrained(model_name)
tokenizer = BertTokenizer.from_pretrained(model_name)
Explanation:
BertForQuestionAnswering is a pre-trained BERT model fine-tuned for QA.
The SQuAD dataset provides context passages and questions with labeled answers, allowing BERT to learn start and end token positions for answers.
BertTokenizer converts text into token IDs and segment embeddings, which match the model’s embedding space.
Reasoning: Using a pre-trained QA model reduces training time and allows for accurate question answering out-of-the-box.
3.1 Embeddings and Tokenization
question = "When was the first dvd released?"
answer_document = "The first DVD ... was released in Japan."
encoding = tokenizer.encode_plus(text=question, text_pair=answer_document)
inputs = encoding['input_ids']
sentence_embedding = encoding['token_type_ids']
tokens = tokenizer.convert_ids_to_tokens(inputs)
Explanation:
encode_plus tokenizes the question and context together.
input_ids represent each token as a number the model can process.
token_type_ids distinguish question tokens (0) from context tokens (1).
Using the same tokenizer as the model is crucial because the model’s embeddings correspond exactly to these token IDs.
BERT outputs start and end logits, indicating the probability of each token being the start or end of the answer.
torch.argmax selects the most probable start and end positions.
The answer is reconstructed by combining the tokens in that range.
Example Result:
question: "When was the first dvd released?"
answer: "march 24 , 1997"
Reasoning: By predicting start and end positions, BERT can extract answers directly from context, unlike generative models that may hallucinate answers.
3.3 Handling Subword Tokens
BERT uses WordPiece tokenization, which splits words into subwords like ##ing.
corrected_answer = ''
for word in answer.split():
if word[0:2] == '##':
corrected_answer += word[2:]
else:
corrected_answer += ' ' + word
Reasoning: Recombining subwords ensures the answer is human-readable and grammatically correct.
4. Visualizing Start and End Scores
import matplotlib.pyplot as plt
import seaborn as sns
s_scores = output.start_logits.detach().numpy().flatten()
e_scores = output.end_logits.detach().numpy().flatten()
start_logits and end_logits can be visualized to understand which tokens the model considers most relevant.
Visualization helps debug and explain model predictions, especially for longer passages.
“What make of cars are available?” → ford , toyota , honda , chevrolet , and bmw
Reasoning: Custom QA pipelines can be built by simply providing domain-specific context, making BERT adaptable to business FAQs.
6. RoBERTa and DistilBERT
RoBERTa (Robustly Optimized BERT Approach) improves BERT by using larger datasets, dynamic masking, and more training steps. It often outperforms BERT on NLP benchmarks.
DistilBERT is a lighter, faster version of BERT, retaining 97% of performance while being 60% smaller. It’s ideal for production environments where speed matters.
from transformers import RobertaTokenizer, RobertaModel
from transformers import DistilBertTokenizer, DistilBertModel
Reasoning: Using these variants allows you to balance performance and computational efficiency.
7. Why Same Model Embeddings Matter
Token embeddings must match the model’s pre-trained vocabulary and embedding space.
Using a different tokenizer can cause misaligned IDs, leading to inaccurate predictions.
For QA, embeddings are crucial because start and end logits depend on token positions.
Conclusion
BERT and its variants have revolutionized question answering in NLP:
BERT is bidirectional and ideal for extracting answers from text.
RoBERTa improves robustness; DistilBERT improves efficiency.
QA models rely on tokenization, embeddings, and start/end logits to predict answers accurately.
Using the same tokenizer and embeddings ensures correctness and consistency.
By leveraging pre-trained BERT models, developers can build FAQ bots, search engines, and AI assistants that understand and answer questions in natural language with high accuracy.
In the previous two parts, we built a strong foundation of LangGraph fundamentals—nodes, edges, message states, conditional routing, reducers, summarization loops, and graph orchestration.
In Part-1 of this LangGraph Blog Series, we understood the foundation of LangGraph — Graph structure, Nodes, Edges, Conditional Routing, State system, and Graph Execution.
Now in Part-2, we upgrade our knowledge and turn LangGraph into a real conversation system.
Modern AI workflows need more than just a prompt and a model call. Real applications require memory, state transitions, branching logic, routing decisions, and orchestration of multiple AI models. This is where LangGraph enters the scene.