Skip to main content

Multimodal (Vision and Audio)

Multimodal inference requires a model and projector pair that supports vision/audio behavior.

Load projector

await engine.loadModel('/path/to/model.gguf');
await engine.loadMultimodalProjector('/path/to/mmproj.gguf');

Build multimodal message

final message = LlamaChatMessage.withContent(
role: LlamaChatRole.user,
content: const [
LlamaImageContent(path: '/path/to/image.jpg'),
LlamaTextContent('Describe this image in one sentence.'),
],
);

await for (final chunk in engine.create([message])) {
final text = chunk.choices.first.delta.content;
if (text != null) {
print(text);
}
}

Capability checks

final supportsVision = await engine.supportsVision;
final supportsAudio = await engine.supportsAudio;

Web notes

  • Web uses bridge runtime paths.
  • Multimodal projector loading on web is URL-based.
  • Local file path media inputs are native-first; web flows use browser file bytes/URLs.