Support investigative journalism — donate to IRE →

Multimodal

adjective
Foundational concepts

A multimodal AI model is a type of generative AI that can not only read and generate text, but also process and generate other types of media such as images, audio, and video.

You might use a multimodal model to:

  • Analyze the contents of an uploaded photo for geolocation or other OSINT applications.
  • Generate a diagram or data visualization from a prompt.
  • Extract text from an photo, such as a screenshot, receipt, or a handwritten note.
  • Generate a video from a text prompt.
  • Generate an image from a text prompt.
  • Transcribe and summarize audio recordings.
  • Converse with the model using a voice chat interface.
Entry by Jon Keegan · Last updated: March 4, 2026
About this glossary — who's behind this site and how you can contribute.