Street Learner

Author

8 min read

Last Updated: a year ago

Text Classification with XLNet: A Comprehensive Guide

Text classification is one of the foundational tasks in Natural Language Processing (NLP). Whether it’s detecting emotions in social media posts, categorizing customer reviews, or identifying spam emails, text classification models are indispensable. Among state-of-the-art models, XLNet has emerged as a powerful alternative to traditional BERT and GPT models.

In this blog, we will cover:

The difference between GPT, BERT, and XLNet
XLNet architecture and embeddings
Step-by-step text classification with XLNet
Detailed explanations of code examples
Why using the same model embeddings is crucial

1. GPT vs BERT vs XLNet

Understanding how these models differ is key for choosing the right model for your NLP task:

GPT (Generative Pre-trained Transformer) is unidirectional and autoregressive. It predicts the next word in a sequence and excels in text generation, chatbots, and story completion.
BERT (Bidirectional Encoder Representations from Transformers) reads text bidirectionally, understanding context from both left and right. It is ideal for classification, question answering, and named entity recognition.
XLNet combines the strengths of GPT and BERT. It is permutation-based, which allows it to consider all possible word orders while still being autoregressive. This makes XLNet better at capturing long-range dependencies and improving performance on NLP tasks like text classification.

2. XLNet Architecture and Embeddings

XLNet is a Transformer-based model like BERT but introduces permutation language modeling instead of masked language modeling. Key components include:

Input Embeddings: Combines token embeddings, position embeddings, and segment embeddings.
Transformer Layers: XLNet uses multiple attention layers to model word dependencies in all possible permutations.
Sequence Summary Layer: For classification, XLNet includes a layer that summarizes sequence embeddings into a single vector used for predictions.

Why embeddings matter: The embeddings represent tokens in the same pre-trained vector space. Using the correct tokenizer ensures tokens map correctly to embeddings, which is essential for accurate predictions.

3. Data Preprocessing

import pandas as pd
from cleantext import clean
import re

data_train = pd.read_csv('./emotions_data/emotion-labels-train.csv') 
data_test = pd.read_csv('./emotions_data/emotion-labels-test.csv')
data_val = pd.read_csv('./emotions_data/emotion-labels-val.csv')
data = pd.concat([data_train, data_test, data_val], ignore_index=True)

data['text_clean'] = data['text'].apply(lambda x: clean(x, no_emoji=True))
data['text_clean'] = data['text_clean'].apply(lambda x: re.sub('@[^\s]+', '', x))

Explanation:

Combines training, testing, and validation datasets into one DataFrame.
Cleans text by removing emojis and mentions.
Text cleaning ensures the model learns from actual content, not noisy characters.

4. Balancing and Encoding Labels

from sklearn.preprocessing import LabelEncoder

data['label_int'] = LabelEncoder().fit_transform(data['label'])

Explanation:

Converts emotion labels into numerical IDs, which are required for model training.
Helps XLNet understand the output space for multi-class classification.

5. Creating Train and Test Splits

from sklearn.model_selection import train_test_split

train_split, test_split = train_test_split(data, train_size=0.8)
train_split, val_split = train_test_split(train_split, train_size=0.9)

Reasoning:

Splits data into training, validation, and testing sets.
Validation set allows monitoring of overfitting during training.

6. Tokenization and Embeddings

from transformers import XLNetTokenizer

tokenizer = XLNetTokenizer.from_pretrained("xlnet-base-cased")

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", max_length=128, truncation=True)

tokenized_datasets = dataset_dict.map(tokenize_function, batched=True)

Explanation:

Converts text into token IDs, attention masks, and segment IDs.
Padding ensures sequences have the same length, allowing batch processing.
Truncation prevents sequences from exceeding the model’s maximum length.

Why same model embeddings matter: XLNet’s pre-trained embeddings correspond to its tokenizer. Using a different tokenizer can misalign tokens with embeddings, causing poor performance.

7. Fine-Tuning XLNet for Text Classification

from transformers import XLNetForSequenceClassification, Trainer, TrainingArguments
import numpy as np
import evaluate

model = XLNetForSequenceClassification.from_pretrained(
    'xlnet-base-cased', 
    num_labels=NUM_LABELS, 
    id2label={0: 'anger', 1: 'fear', 2: 'joy', 3: 'sadness'}
)

metric = evaluate.load("accuracy")
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

training_args = TrainingArguments(output_dir="test_trainer", eval_strategy="epoch", num_train_epochs=3)

trainer = Trainer(
    model=model, 
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics
)
trainer.train()

Explanation:

XLNetForSequenceClassification fine-tunes XLNet for emotion classification.
The model predicts logits for each emotion, which are converted into class probabilities.
Trainer handles training, evaluation, and metrics.
compute_metrics calculates accuracy to monitor performance.

Reasoning: Fine-tuning adapts pre-trained embeddings to the specific dataset, improving classification accuracy.

8. Evaluating the Model

trainer.evaluate()

Explanation:

Outputs evaluation loss and accuracy.
Helps verify if the model generalizes to unseen data.

9. Using XLNet for Prediction

from transformers import pipeline
clf = pipeline("text-classification", fine_tuned_model, tokenizer=tokenizer)

answer = clf(val_split['text_clean'][rand_int], top_k=None)

Explanation:

The pipeline API provides a simple interface for predictions.
Returns probabilities for each emotion class, allowing easy interpretation.

Example Prediction:

Text: "you dont have to feel grateful to be grateful..."
Output: [{'label': 'sadness', 'score': 0.37}, {'label': 'anger', 'score': 0.23}, {'label': 'fear', 'score': 0.22}, {'label': 'joy', 'score': 0.16}]

Reasoning: Probabilistic outputs allow developers to handle uncertainty in predictions.

10. Why XLNet Embeddings Are Crucial

XLNet embeddings are pre-trained on large corpora, capturing semantic and syntactic relationships.
Using the same tokenizer and embeddings ensures tokens match the model’s understanding, which is essential for tasks like text classification.
Fine-tuning leverages these embeddings to specialize in domain-specific sentiment or emotion detection.

Conclusion

XLNet is a state-of-the-art model for text classification. Compared to GPT and BERT:

GPT is best for text generation
BERT is ideal for bidirectional understanding
XLNet excels in permutation-based modeling, capturing long-range dependencies

By combining clean data preprocessing, tokenization, embeddings, and fine-tuning, developers can build accurate and efficient text classification systems. Using the same model embeddings ensures predictions align with the model’s pre-trained knowledge and improves performance significantly.

XLNet, with its robust architecture, is ideal for emotion detection, sentiment analysis, and other classification tasks in real-world NLP applications.

Text Classification with XLNet: A Comprehensive Guide

1. GPT vs BERT vs XLNet

2. XLNet Architecture and Embeddings

3. Data Preprocessing

4. Balancing and Encoding Labels

5. Creating Train and Test Splits

6. Tokenization and Embeddings

7. Fine-Tuning XLNet for Text Classification

8. Evaluating the Model

9. Using XLNet for Prediction

10. Why XLNet Embeddings Are Crucial

Conclusion

Related Stories

LangGraph Part 3: Checkpointing, Memory Persistence & State Snapshots in Production AI Systems

LangGraph Part-2: Complete Message Management System With Reducers, Annotated Framework & Dynamic Memory

Part 1 – Introduction to LangGraph & Understanding State, Nodes, Edges and Conditional Routing (with Typesafe Python)

LangGraph Part 3: Checkpointing, Memory Persistence & State Snapshots in Production AI Systems

LangGraph Part-2: Complete Message Management System With Reducers, Annotated Framework & Dynamic Memory

Part 1 – Introduction to LangGraph & Understanding State, Nodes, Edges and Conditional Routing (with Typesafe Python)