Street Learner

Author

7 min read

Last Updated: a year ago

Question-Answer Models with BERT: A Complete Guide

Question Answering (QA) models have become a crucial part of AI-driven applications. They allow machines to read text and provide accurate answers to questions in natural language. Among the most widely used models for QA are BERT (Bidirectional Encoder Representations from Transformers) and its derivatives like RoBERTa and DistilBERT.

In this guide, we’ll explore BERT architecture, embeddings, QA pipelines, and provide detailed code explanations.

1. BERT vs GPT: Understanding the Difference

Before diving into QA models, it’s essential to understand the difference between BERT and GPT:

BERT is a bidirectional transformer model. It reads text both left-to-right and right-to-left, which allows it to understand context deeply. BERT is ideal for tasks like question answering, named entity recognition, and sentence classification.
GPT is a unidirectional, autoregressive model, primarily used for text generation tasks. GPT predicts the next word in a sequence and is excellent for chatbots, story generation, and completion tasks.

Key takeaway: BERT excels in understanding context for tasks like QA, while GPT excels in generating coherent text.

2. BERT Architecture and Embeddings

BERT is based on the Transformer architecture, which relies on self-attention mechanisms to model relationships between all words in a sentence simultaneously. Key components include:

Input Embeddings: Combines token embeddings, position embeddings, and segment embeddings.
Encoder Layers: Multiple transformer encoder layers (12 for BERT-base, 24 for BERT-large) process these embeddings bidirectionally.
Output Layers: Can be fine-tuned for specific tasks like QA, classification, or NER.

BERT embeddings capture rich contextual information. The same token can have different embeddings depending on surrounding words, which makes BERT extremely powerful for understanding natural language.

3. Question Answering with BERT

BERT can be fine-tuned on datasets like SQuAD (Stanford Question Answering Dataset) to create QA models. Let’s explore the example:

from transformers import BertForQuestionAnswering, BertTokenizer
import torch

model_name = "bert-large-uncased-whole-word-masking-finetuned-squad"
model = BertForQuestionAnswering.from_pretrained(model_name)
tokenizer = BertTokenizer.from_pretrained(model_name)

Explanation:

BertForQuestionAnswering is a pre-trained BERT model fine-tuned for QA.
The SQuAD dataset provides context passages and questions with labeled answers, allowing BERT to learn start and end token positions for answers.
BertTokenizer converts text into token IDs and segment embeddings, which match the model’s embedding space.

Reasoning: Using a pre-trained QA model reduces training time and allows for accurate question answering out-of-the-box.

3.1 Embeddings and Tokenization

question = "When was the first dvd released?"
answer_document = "The first DVD ... was released in Japan."
encoding = tokenizer.encode_plus(text=question, text_pair=answer_document)
inputs = encoding['input_ids']
sentence_embedding = encoding['token_type_ids']
tokens = tokenizer.convert_ids_to_tokens(inputs)

Explanation:

encode_plus tokenizes the question and context together.
input_ids represent each token as a number the model can process.
token_type_ids distinguish question tokens (0) from context tokens (1).
Using the same tokenizer as the model is crucial because the model’s embeddings correspond exactly to these token IDs.

3.2 Predicting Answers

output = model(input_ids=torch.tensor([inputs]), token_type_ids=torch.tensor([sentence_embedding]))
start_index = torch.argmax(output.start_logits)
end_index = torch.argmax(output.end_logits)
answer = ' '.join(tokens[start_index:end_index+1])

Explanation:

BERT outputs start and end logits, indicating the probability of each token being the start or end of the answer.
torch.argmax selects the most probable start and end positions.
The answer is reconstructed by combining the tokens in that range.

Example Result:

question: "When was the first dvd released?"
answer: "march 24 , 1997"

Reasoning: By predicting start and end positions, BERT can extract answers directly from context, unlike generative models that may hallucinate answers.

3.3 Handling Subword Tokens

BERT uses WordPiece tokenization, which splits words into subwords like ##ing.

corrected_answer = ''
for word in answer.split():
    if word[0:2] == '##':
        corrected_answer += word[2:]
    else:
        corrected_answer += ' ' + word

Reasoning: Recombining subwords ensures the answer is human-readable and grammatically correct.

4. Visualizing Start and End Scores

import matplotlib.pyplot as plt
import seaborn as sns

s_scores = output.start_logits.detach().numpy().flatten()
e_scores = output.end_logits.detach().numpy().flatten()

start_logits and end_logits can be visualized to understand which tokens the model considers most relevant.
Visualization helps debug and explain model predictions, especially for longer passages.

5. Example: QA over Custom Context

def faq_bot(question):
    context = sunset_motors_context
    input_ids = tokenizer.encode(question, context)
    tokens = tokenizer.convert_ids_to_tokens(input_ids)
    sep_idx = input_ids.index(tokenizer.sep_token_id)
    segment_ids = [0]*(sep_idx+1) + [1]*(len(input_ids)-(sep_idx+1))
    output = model(torch.tensor([input_ids]), token_type_ids=torch.tensor([segment_ids]))
    answer_start = torch.argmax(output.start_logits)
    answer_end = torch.argmax(output.end_logits)
    answer = ' '.join(tokens[answer_start:answer_end+1])

Example QA Results:

“Where is the dealership located?” → crestwood
“What make of cars are available?” → ford , toyota , honda , chevrolet , and bmw

Reasoning: Custom QA pipelines can be built by simply providing domain-specific context, making BERT adaptable to business FAQs.

6. RoBERTa and DistilBERT

RoBERTa (Robustly Optimized BERT Approach) improves BERT by using larger datasets, dynamic masking, and more training steps. It often outperforms BERT on NLP benchmarks.
DistilBERT is a lighter, faster version of BERT, retaining 97% of performance while being 60% smaller. It’s ideal for production environments where speed matters.

from transformers import RobertaTokenizer, RobertaModel
from transformers import DistilBertTokenizer, DistilBertModel

Reasoning: Using these variants allows you to balance performance and computational efficiency.

7. Why Same Model Embeddings Matter

Token embeddings must match the model’s pre-trained vocabulary and embedding space.
Using a different tokenizer can cause misaligned IDs, leading to inaccurate predictions.
For QA, embeddings are crucial because start and end logits depend on token positions.

Conclusion

BERT and its variants have revolutionized question answering in NLP:

BERT is bidirectional and ideal for extracting answers from text.
RoBERTa improves robustness; DistilBERT improves efficiency.
QA models rely on tokenization, embeddings, and start/end logits to predict answers accurately.
Using the same tokenizer and embeddings ensures correctness and consistency.

By leveraging pre-trained BERT models, developers can build FAQ bots, search engines, and AI assistants that understand and answer questions in natural language with high accuracy.

Question-Answer Models with BERT: A Complete Guide

1. BERT vs GPT: Understanding the Difference

2. BERT Architecture and Embeddings

3. Question Answering with BERT

3.1 Embeddings and Tokenization

3.2 Predicting Answers

3.3 Handling Subword Tokens

4. Visualizing Start and End Scores

5. Example: QA over Custom Context

6. RoBERTa and DistilBERT

7. Why Same Model Embeddings Matter

Conclusion

Related Stories

LangGraph Part 3: Checkpointing, Memory Persistence & State Snapshots in Production AI Systems

LangGraph Part-2: Complete Message Management System With Reducers, Annotated Framework & Dynamic Memory

Part 1 – Introduction to LangGraph & Understanding State, Nodes, Edges and Conditional Routing (with Typesafe Python)