Hugging Face

Introduction to Multimodal AI with Hugging Face

Keep adding new skills with 10,000+ programs for $239 (usually $399). Save now.

Hugging Face

Introduction to Multimodal AI with Hugging Face

John Whitworth

Instructor: John Whitworth

Gain insight into a topic and learn the fundamentals.
Intermediate level

Recommended experience

6 hours to complete
Flexible schedule
Learn at your own pace
Gain insight into a topic and learn the fundamentals.
Intermediate level

Recommended experience

6 hours to complete
Flexible schedule
Learn at your own pace

What you'll learn

  • Use vision-language models for image understanding and document extraction.

  • Build audio transcription, image generation, and agentic VLM/MCP workflows.

  • Apply multimodal safety filtering for responsible AI deployment.

Details to know

Shareable certificate

Add to your LinkedIn profile

Recently updated!

June 2026

Assessments

5 assignments

Taught in English

See how employees at top companies are mastering in-demand skills

 logos of Petrobras, TATA, Danone, Capgemini, P&G and L'Oreal

There are 4 modules in this course

Most AI models see one thing at a time — text or images, never both. Vision-language models change that, and the key insight starts with CLIP: images and text can live in the same embedding space. This module builds your multimodal mental model from CLIP to modern VLMs, then puts them to work on real tasks: visual question answering, image captioning, and document AI.

What's included

4 videos1 reading1 assignment1 ungraded lab

Multimodal AI isn’t limited to vision — audio transcription and image generation are equally practical capabilities that HF makes accessible through Whisper and Diffusers. This module covers both, then introduces the strategic decision every practitioner faces: when to fine-tune a model with LoRA versus when to use retrieval-augmented generation to give the model better context.

What's included

3 videos1 reading1 assignment1 ungraded lab

Running a single model is useful. Building a system where a model can see, reason, pick tools, act, and iterate — that’s an agent. This module teaches you to build agentic workflows with HF smolagents, connect agents to external tools via MCP (Model Context Protocol), and give agents vision capabilities so they can reason over screenshots and visual inputs.

What's included

3 videos1 reading1 assignment1 ungraded lab

A multimodal system that works in a notebook can still fail catastrophically in production — generating harmful images, misreading sensitive documents, or amplifying biases across modalities. This module teaches you to wrap VLM pipelines with safety filtering, test against adversarial inputs, and document failure modes before anyone else finds them.

What's included

4 videos2 readings2 assignments1 ungraded lab

Instructor

John Whitworth
Hugging Face
30 Courses3,545 learners

Offered by

Hugging Face

Why people choose Coursera for their career

Felipe M.

Learner since 2018
"To be able to take courses at my own pace and rhythm has been an amazing experience. I can learn whenever it fits my schedule and mood."

Jennifer J.

Learner since 2020
"I directly applied the concepts and skills I learned from my courses to an exciting new project at work."

Larry W.

Learner since 2021
"When I need courses on topics that my university doesn't offer, Coursera is one of the best places to go."

Chaitanya A.

"Learning isn't just about being better at your job: it's so much more than that. Coursera allows me to learn without limits."

Frequently asked questions