Skip to main content

Multimodal (Vision and Audio)

Multimodal inference requires a model and projector pair that supports vision/audio behavior.

Load projector

await engine.loadModel('/path/to/model.gguf');
await engine.loadMultimodalProjector('/path/to/mmproj.gguf');

Build multimodal message

final message = LlamaChatMessage.withContent(
  role: LlamaChatRole.user,
  content: const [
    LlamaImageContent(path: '/path/to/image.jpg'),
    LlamaTextContent('Describe this image in one sentence.'),
  ],
);

await for (final chunk in engine.create([message])) {
  final text = chunk.choices.first.delta.content;
  if (text != null) {
    print(text);
  }
}

Capability checks

final supportsVision = await engine.supportsVision;
final supportsAudio = await engine.supportsAudio;

Web notes

Web uses bridge runtime paths.
Multimodal projector loading on web is URL-based.
Local file path media inputs are native-first; web flows use browser file bytes/URLs.

Load projector
Build multimodal message
Capability checks
Web notes