All terms · Model Architecture

Multimodal

An AI model that can process and understand multiple types of input—text, images, video, audio.

A multimodal model integrates different input modalities. Early models were text-only (GPT-3); modern models are multimodal. ChatGPT can read text and images. Claude can analyze PDFs, images, and text. Gemini understands text, images, and video. Some emerging models handle audio too.

Multimodal models are more versatile. You can upload a screenshot and ask "what does this button do?", paste a photo of a handwritten note and ask for OCR, or supply an image and ask "describe what you see." Each modality requires its own encoder to convert the input into vectors the model can process, then a shared representation layer combines them.

Multimodal capability has become a baseline expectation—single-modality (text-only) models are increasingly rare.

Example

ChatGPT GPT-4 Vision is multimodal: you can send text, images, or both, and ask it anything about them.