Multimodal
adjective
A multimodal AI model is a type of generative AI that can not only read and generate text, but also process and generate other types of media such as images, audio, and video.
You might use a multimodal model to:
- Analyze the contents of an uploaded photo for geolocation or other OSINT applications.
- Generate a diagram or data visualization from a prompt.
- Extract text from an photo, such as a screenshot, receipt, or a handwritten note.
- Generate a video from a text prompt.
- Generate an image from a text prompt.
- Transcribe and summarize audio recordings.
- Converse with the model using a voice chat interface.
About this glossary — who's behind this site and how you can contribute.