All terms · Model Architecture

Multimodal

An AI model that can process and understand multiple types of input—text, images, video, audio.

A multimodal model integrates different input modalities. Early models were text-only (GPT-3); modern models are multimodal. ChatGPT can read text and images. Claude can analyze PDFs, images, and text. Gemini understands text, images, and video. Some emerging models handle audio too.

Multimodal models are more versatile. You can upload a screenshot and ask "what does this button do?", paste a photo of a handwritten note and ask for OCR, or supply an image and ask "describe what you see." Each modality requires its own encoder to convert the input into vectors the model can process, then a shared representation layer combines them.

Multimodal capability has become a baseline expectation—single-modality (text-only) models are increasingly rare.

Example

ChatGPT GPT-4 Vision is multimodal: you can send text, images, or both, and ask it anything about them.

Related terms

Embedding

A numerical representation of text (or other data) that captures meaning, enabling semantic search and comparison.

Transformer

The neural network architecture that powers modern large language models.

Large Language Model (LLM)

A neural network trained on massive amounts of text to predict and generate human language.

Used in

Chatgpt → Claude → Gemini →

Back to glossary