Bordered avatar

Street Learner

Author
7 min read

Last Updated: a year ago

Hugging Face Transformers: A Complete Guide with Examples

Hugging Face Transformers: A Complete Guide with Examples

In recent years, Hugging Face Transformers has emerged as one of the most powerful libraries for Natural Language Processing (NLP) and AI applications. From sentiment analysis to named entity recognition and zero-shot classification, Hugging Face provides pre-trained models that make it easy to perform complex NLP tasks with minimal code.

In this guide, we’ll explain everything step by step, covering key examples, code, and the reasoning behind using Hugging Face Transformers.

1. Introduction to Hugging Face Transformers

Hugging Face Transformers is a Python library that provides pre-trained models for NLP tasks. These models can be used for text classification, sentiment analysis, question answering, translation, summarization, and much more. The library supports both PyTorch and TensorFlow, making it versatile for researchers and developers.

The major benefits of using Hugging Face include:

  • Access to state-of-the-art models without training from scratch.
  • Easy-to-use pipelines for common NLP tasks.
  • Support for tokenizers and embeddings, crucial for transforming text into model-readable formats.
  • Compatibility with PyTorch and TensorFlow.

2. Sentiment Analysis Example

Sentiment analysis is a common NLP task where the goal is to determine if a text expresses a positive, negative, or neutral sentiment.

from transformers import pipeline

sentiment_classifier = pipeline("sentiment-analysis")
result = sentiment_classifier("I'm so excited to be learning about large language models")
print(result)

Explanation:

  • pipeline("sentiment-analysis") automatically loads a default sentiment model (distilbert-base-uncased-finetuned-sst-2-english) trained on movie reviews.
  • The pipeline converts text into token IDs, passes them through the model, and outputs a label (POSITIVE or NEGATIVE) with a confidence score.
  • Using a pipeline simplifies the workflow; you don’t need to manually preprocess the text or handle model outputs.

Reasoning: Pre-trained models save time, resources, and provide high accuracy without the need for training from scratch.

3. Named Entity Recognition (NER)

NER identifies proper names, organizations, dates, and locations in text.

ner = pipeline("ner", model="dslim/bert-base-NER")
ner("Her name is Anna and she works in New York City for Morgan Stanley")

Explanation:

  • dslim/bert-base-NER is a BERT model fine-tuned for token classification, specifically NER.
  • The model predicts entities like person names, organizations, or locations in the text.

Reasoning: Using NER pipelines allows developers to extract structured information from unstructured text easily. Pre-trained models are preferred since training NER from scratch requires a large annotated dataset.

4. Zero-Shot Classification

Zero-shot classification allows you to classify text without explicit training on the target labels.

from transformers import pipeline

zeroshot_classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")
sequence_to_classify = "one day I will see the world"
candidate_labels = ['travel', 'cooking', 'dancing']
result = zeroshot_classifier(sequence_to_classify, candidate_labels)
print(result)

Explanation:

  • facebook/bart-large-mnli is trained for natural language inference (NLI).
  • The model measures how well a text matches each candidate label.
  • This is called zero-shot because it can classify text without explicit supervised training for these categories.

Reasoning: Zero-shot classification is extremely useful when you don’t have a labeled dataset for your specific task.

5. Pre-trained Tokenizers

Before feeding text to a model, it must be converted into token IDs.

from transformers import AutoTokenizer

model = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model)
sentence = "I'm so excited to be learning about large language models"
input_ids = tokenizer(sentence)
print(input_ids)

Explanation:

  • The tokenizer converts text into integer IDs, which the model understands.
  • Tokenizers also provide attention_mask to indicate which tokens are meaningful and token_type_ids for distinguishing multiple sequences.

Why the Same Model Tokenizer is Needed: The embedding space of tokens must match the model. Using a different tokenizer than the model’s pre-trained tokenizer can result in incorrect embeddings and poor performance.

Tokenization Example with XLNet

model2 = "xlnet-base-cased"
tokenizer2 = AutoTokenizer.from_pretrained(model2)
input_ids = tokenizer2(sentence)
tokens = tokenizer2.tokenize(sentence)
  • XLNet uses different tokenization strategies (like subword tokens ▁) compared to BERT.
  • Tokenizers handle splitting words into subwords, special tokens ([CLS], [SEP]), and numerical encoding.

Reasoning: Matching the correct tokenizer ensures the model interprets text correctly, respecting its pre-trained embeddings.

6. Hugging Face with PyTorch

You can directly use Hugging Face models with PyTorch:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
input_ids_pt = tokenizer(sentence, return_tensors="pt")
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

with torch.no_grad():
    logits = model(**input_ids_pt).logits

predicted_class_id = logits.argmax().item()
print(model.config.id2label[predicted_class_id])

Explanation:

  • Converts text to PyTorch tensors using return_tensors="pt".
  • The model outputs logits for each class.
  • Using argmax, we determine the predicted class (POSITIVE or NEGATIVE).

Reasoning: Direct use with PyTorch allows custom training, fine-tuning, or integration with more complex pipelines.

7. Saving and Loading Models

model_directory = "my_saved_models"
tokenizer.save_pretrained(model_directory)
model.save_pretrained(model_directory)

# Load again
from transformers import AutoTokenizer, AutoModelForSequenceClassification
my_tokenizer = AutoTokenizer.from_pretrained(model_directory)
my_model = AutoModelForSequenceClassification.from_pretrained(model_directory)

Explanation:

  • Hugging Face allows saving pre-trained models locally.
  • This ensures reproducibility and faster loading in future projects without downloading again.

Reasoning: Saving models is critical in production environments where reliability and efficiency matter.

8. Why Use the Same Model and Tokenizer?

  • The tokenizer determines the input IDs and embeddings.
  • The model interprets these embeddings.
  • Using mismatched tokenizers can lead to incorrect token IDs, breaking the model’s ability to understand text.

Always use the tokenizer paired with the pre-trained model to maintain embedding consistency.

Conclusion

Hugging Face Transformers make NLP tasks accessible to everyone. Key takeaways:

  • Sentiment Analysis for understanding opinions.
  • NER for extracting structured entities.
  • Zero-Shot Classification for flexible categorization.
  • Tokenizers and Embeddings ensure text is interpreted correctly.
  • PyTorch Integration allows fine-tuning and customization.
  • Saving models guarantees efficiency in production.

By understanding each component of Hugging Face, you can build robust NLP pipelines for research, business, or AI-powered applications.

Related Stories