By the end of this course, you will be able to:
• Explain how CLIP aligns image and text in a shared embedding space, use VLMs to perform visual question answering, image captioning, and document understanding, and navigate the Hub for multimodal models. • Build a pipeline that transcribes audio with Whisper and generates images with Diffusers, and describe how LoRA fine-tuning and multimodal RAG extend VLM capabilities. • Build an agentic workflow using smolagents with VLM support and MCP tool integration to automate multi-step tasks requiring vision and reasoning. • Apply ShieldGemma 2 to filter inputs and outputs of a VLM pipeline, test against adversarial inputs, and document failure modes for responsible deployment. AI that can only read text is already behind. This intermediate course assumes you're comfortable with the HF Transformers library and basic Gradio development. It opens with a practical challenge: 2,000 products with photos but no descriptions, and a stack of invoice PDFs that need structured data extraction. You’ll learn how CLIP aligned images and text in a shared space, then use modern vision-language models to caption products, answer questions about charts, and pull fields from invoices. Go wider: transcribe customer calls with Whisper, generate images from text briefs with Diffusers, and learn when to fine-tune a model versus when to give it better context through retrieval. Build agent workflows that can see screenshots, reason about what’s on screen, and connect to external tools through the Model Context Protocol (MCP) to act on what they find. The course closes with a deployment readiness review: your CTO wants to launch the AI pipeline next week, and you need to decide whether it’s safe to ship — with safety filtering, adversarial testing, and documented failure modes backing your recommendation.













