Introduction to Multimodal AI with Hugging Face

Keep adding new skills with 10,000+ programs for $239 (usually $399). Save now.

Introduction to Multimodal AI with Hugging Face

Instructor: John Whitworth

4 modules

Gain insight into a topic and learn the fundamentals.

Intermediate level

Recommended experience

6 hours to complete

Flexible schedule

Learn at your own pace

4 modules

Gain insight into a topic and learn the fundamentals.

Intermediate level

Recommended experience

6 hours to complete

Flexible schedule

Learn at your own pace

What you'll learn

Use vision-language models for image understanding and document extraction.
Build audio transcription, image generation, and agentic VLM/MCP workflows.
Apply multimodal safety filtering for responsible AI deployment.

Details to know

Shareable certificate

Add to your LinkedIn profile

See how employees at top companies are mastering in-demand skills

Learn more about Coursera for Business

logos of Petrobras, TATA, Danone, Capgemini, P&G and L'Oreal

There are 4 modules in this course

By the end of this course, you will be able to:

• Explain how CLIP aligns image and text in a shared embedding space, use VLMs to perform visual question answering, image captioning, and document understanding, and navigate the Hub for multimodal models. • Build a pipeline that transcribes audio with Whisper and generates images with Diffusers, and describe how LoRA fine-tuning and multimodal RAG extend VLM capabilities. • Build an agentic workflow using smolagents with VLM support and MCP tool integration to automate multi-step tasks requiring vision and reasoning. • Apply ShieldGemma 2 to filter inputs and outputs of a VLM pipeline, test against adversarial inputs, and document failure modes for responsible deployment. AI that can only read text is already behind. This intermediate course assumes you're comfortable with the HF Transformers library and basic Gradio development. It opens with a practical challenge: 2,000 products with photos but no descriptions, and a stack of invoice PDFs that need structured data extraction. You’ll learn how CLIP aligned images and text in a shared space, then use modern vision-language models to caption products, answer questions about charts, and pull fields from invoices. Go wider: transcribe customer calls with Whisper, generate images from text briefs with Diffusers, and learn when to fine-tune a model versus when to give it better context through retrieval. Build agent workflows that can see screenshots, reason about what’s on screen, and connect to external tools through the Model Context Protocol (MCP) to act on what they find. The course closes with a deployment readiness review: your CTO wants to launch the AI pipeline next week, and you need to decide whether it’s safe to ship — with safety filtering, adversarial testing, and documented failure modes backing your recommendation.

Most AI models see one thing at a time — text or images, never both. Vision-language models change that, and the key insight starts with CLIP: images and text can live in the same embedding space. This module builds your multimodal mental model from CLIP to modern VLMs, then puts them to work on real tasks: visual question answering, image captioning, and document AI.

What's included

4 videos1 reading1 assignment1 ungraded lab

4 videosTotal 26 minutes

Welcome: From Text-Only to Multimodal AI4 minutes
How CLIP Aligns Images and Text in a Shared Space7 minutes
Visual Question Answering and Image Captioning with VLMs8 minutes
Document AI — OCR, Layout Parsing, and Structured Extraction8 minutes

1 readingTotal 4 minutes

Multimodal Models and Hub Navigation Reference4 minutes

1 assignmentTotal 30 minutes

Practice Assignment: Multimodal Foundations and VLMs30 minutes

1 ungraded labTotal 18 minutes

Caption Products and Extract Invoice Data for BrightCart18 minutes

Multimodal AI isn’t limited to vision — audio transcription and image generation are equally practical capabilities that HF makes accessible through Whisper and Diffusers. This module covers both, then introduces the strategic decision every practitioner faces: when to fine-tune a model with LoRA versus when to use retrieval-augmented generation to give the model better context.

What's included

3 videos1 reading1 assignment1 ungraded lab

3 videosTotal 20 minutes

Transcribing Audio with Whisper7 minutes
Generating Images from Text with Diffusers7 minutes
When to Fine-Tune vs. When to Retrieve — LoRA and Multimodal RAG7 minutes

1 readingTotal 4 minutes

Audio, Diffusers, and Adaptation Strategies Reference4 minutes

1 assignmentTotal 30 minutes

Practice Assignment: Audio, Generation, and Adaptation Strategies30 minutes

1 ungraded labTotal 20 minutes

Transcribe Calls and Generate Visual Summaries for BrightCart20 minutes

Running a single model is useful. Building a system where a model can see, reason, pick tools, act, and iterate — that’s an agent. This module teaches you to build agentic workflows with HF smolagents, connect agents to external tools via MCP (Model Context Protocol), and give agents vision capabilities so they can reason over screenshots and visual inputs.

What's included

3 videos1 reading1 assignment1 ungraded lab

3 videosTotal 27 minutes

Building Your First Agent with smolagents9 minutes
Connecting Agents to External Tools via MCP9 minutes
Vision-Powered Agents — Screenshot, Reason, Act, Iterate9 minutes

1 readingTotal 3 minutes

Smolagents, MCP, and Agent Design Patterns Reference3 minutes

1 assignmentTotal 30 minutes

Practice Assignment: Agents, MCP, and Tool Use30 minutes

1 ungraded labTotal 18 minutes

Build an Agent That Automates BrightCart’s Catalog Workflow18 minutes

A multimodal system that works in a notebook can still fail catastrophically in production — generating harmful images, misreading sensitive documents, or amplifying biases across modalities. This module teaches you to wrap VLM pipelines with safety filtering, test against adversarial inputs, and document failure modes before anyone else finds them.

What's included

4 videos2 readings2 assignments1 ungraded lab

4 videosTotal 24 minutes

Multimodal Safety Risks — What Can Go Wrong and Why6 minutes
Filtering with ShieldGemma 2 — Input and Output Safety8 minutes
Testing Against Adversarial Inputs and Documenting Failure Modes8 minutes
What You Can See, Build, and Ship Safely2 minutes

2 readingsTotal 8 minutes

Multimodal Safety and Responsible Deployment Reference4 minutes
Applying Your Multimodal AI and Agent Skills4 minutes

2 assignmentsTotal 60 minutes

Final Assessment: Introduction to Multimodal AI with HF 30 minutes
Practice Assignment: Responsible Deployment30 minutes

1 ungraded labTotal 18 minutes

Wrap BrightCart’s VLM Pipeline with Safety Filtering18 minutes

Instructor

John Whitworth

Hugging Face

30 Courses3,545 learners

Offered by

Hugging Face

Why people choose Coursera for their career

Felipe M.

Learner since 2018

"To be able to take courses at my own pace and rhythm has been an amazing experience. I can learn whenever it fits my schedule and mood."

Jennifer J.

Learner since 2020

"I directly applied the concepts and skills I learned from my courses to an exciting new project at work."

Larry W.

Learner since 2021

"When I need courses on topics that my university doesn't offer, Coursera is one of the best places to go."

Chaitanya A.

"Learning isn't just about being better at your job: it's so much more than that. Coursera allows me to learn without limits."

Frequently asked questions

To access the course materials, assignments and to earn a Certificate, you will need to purchase the Certificate experience when you enroll in a course. You can try a Free Trial instead, or apply for Financial Aid. The course may offer 'Full Course, No Certificate' instead. This option lets you see all course materials, submit required assessments, and get a final grade. This also means that you will not be able to purchase a Certificate experience.

When you purchase a Certificate you get access to all course materials, including graded assignments. Upon completing the course, your electronic Certificate will be added to your Accomplishments page - from there, you can print your Certificate or add it to your LinkedIn profile.

Yes. In select learning programs, you can apply for financial aid or a scholarship if you can’t afford the enrollment fee. If fin aid or scholarship is available for your learning program selection, you’ll find a link to apply on the description page.