Multimodal

adjective

Foundational concepts

A multimodal AI model is a type of generative AI that can not only read and generate text, but also process and generate other types of media such as images, audio, and video.

You might use a multimodal model to:

Analyze the contents of an uploaded photo for geolocation or other OSINT applications.
Generate a diagram or data visualization from a prompt.
Extract text from an photo, such as a screenshot, receipt, or a handwritten note.
Generate a video from a text prompt.
Generate an image from a text prompt.
Transcribe and summarize audio recordings.
Converse with the model using a voice chat interface.

Entry by Jon Keegan · Last updated: March 4, 2026

Flag Changelog

About this glossary — who's behind this site and how you can contribute.