Hugging Face Transformers: A Complete Guide with Examples
In recent years, Hugging Face Transformers has emerged as one of the most powerful libraries for Natural Language Processing (NLP) and AI applications. From sentiment analysis to named entity recognition and zero-shot classification, Hugging Face provides pre-trained models that make it easy to perform complex NLP tasks with minimal code.
In this guide, we’ll explain everything step by step, covering key examples, code, and the reasoning behind using Hugging Face Transformers.
1. Introduction to Hugging Face Transformers
Hugging Face Transformers is a Python library that provides pre-trained models for NLP tasks. These models can be used for text classification, sentiment analysis, question answering, translation, summarization, and much more. The library supports both PyTorch and TensorFlow, making it versatile for researchers and developers.
The major benefits of using Hugging Face include:
Access to state-of-the-art models without training from scratch.
Easy-to-use pipelines for common NLP tasks.
Support for tokenizers and embeddings, crucial for transforming text into model-readable formats.
Compatibility with PyTorch and TensorFlow.
2. Sentiment Analysis Example
Sentiment analysis is a common NLP task where the goal is to determine if a text expresses a positive, negative, or neutral sentiment.
from transformers import pipeline
sentiment_classifier = pipeline("sentiment-analysis")
result = sentiment_classifier("I'm so excited to be learning about large language models")
print(result)
Explanation:
pipeline("sentiment-analysis") automatically loads a default sentiment model (distilbert-base-uncased-finetuned-sst-2-english) trained on movie reviews.
The pipeline converts text into token IDs, passes them through the model, and outputs a label (POSITIVE or NEGATIVE) with a confidence score.
Using a pipeline simplifies the workflow; you don’t need to manually preprocess the text or handle model outputs.
Reasoning: Pre-trained models save time, resources, and provide high accuracy without the need for training from scratch.
3. Named Entity Recognition (NER)
NER identifies proper names, organizations, dates, and locations in text.
ner = pipeline("ner", model="dslim/bert-base-NER")
ner("Her name is Anna and she works in New York City for Morgan Stanley")
Explanation:
dslim/bert-base-NER is a BERT model fine-tuned for token classification, specifically NER.
The model predicts entities like person names, organizations, or locations in the text.
Reasoning: Using NER pipelines allows developers to extract structured information from unstructured text easily. Pre-trained models are preferred since training NER from scratch requires a large annotated dataset.
4. Zero-Shot Classification
Zero-shot classification allows you to classify text without explicit training on the target labels.
from transformers import pipeline
zeroshot_classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")
sequence_to_classify = "one day I will see the world"
candidate_labels = ['travel', 'cooking', 'dancing']
result = zeroshot_classifier(sequence_to_classify, candidate_labels)
print(result)
Explanation:
facebook/bart-large-mnli is trained for natural language inference (NLI).
The model measures how well a text matches each candidate label.
This is called zero-shot because it can classify text without explicit supervised training for these categories.
Reasoning: Zero-shot classification is extremely useful when you don’t have a labeled dataset for your specific task.
5. Pre-trained Tokenizers
Before feeding text to a model, it must be converted into token IDs.
from transformers import AutoTokenizer
model = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model)
sentence = "I'm so excited to be learning about large language models"
input_ids = tokenizer(sentence)
print(input_ids)
Explanation:
The tokenizer converts text into integer IDs, which the model understands.
Tokenizers also provide attention_mask to indicate which tokens are meaningful and token_type_ids for distinguishing multiple sequences.
Why the Same Model Tokenizer is Needed:
The embedding space of tokens must match the model. Using a different tokenizer than the model’s pre-trained tokenizer can result in incorrect embeddings and poor performance.
In the previous two parts, we built a strong foundation of LangGraph fundamentals—nodes, edges, message states, conditional routing, reducers, summarization loops, and graph orchestration.
In Part-1 of this LangGraph Blog Series, we understood the foundation of LangGraph — Graph structure, Nodes, Edges, Conditional Routing, State system, and Graph Execution.
Now in Part-2, we upgrade our knowledge and turn LangGraph into a real conversation system.
Modern AI workflows need more than just a prompt and a model call. Real applications require memory, state transitions, branching logic, routing decisions, and orchestration of multiple AI models. This is where LangGraph enters the scene.