Skip to main content

Generation and Streaming

llamadart exposes two generation styles:

  • engine.generate(prompt) for raw prompt strings.
  • engine.create(messages) for chat-template aware completions.

Generation pipeline (visual)

Low-level generation API

await for (final token in engine.generate(
'List two advantages of local LLM inference.',
params: const GenerationParams(maxTokens: 64, temp: 0.4),
)) {
print(token);
}

Chat completion API

final messages = [
LlamaChatMessage.fromText(
role: LlamaChatRole.user,
text: 'Explain top-p in plain language.',
),
];

await for (final chunk in engine.create(
messages,
params: const GenerationParams(maxTokens: 128, topP: 0.95),
)) {
final text = chunk.choices.first.delta.content;
if (text != null) {
print(text);
}
}

create(...) flow at a glance

  1. Build your List<LlamaChatMessage>.
  2. engine.create(...) runs template rendering/parity logic.
  3. Effective stop sequences and grammar are applied to generation params.
  4. Backend token bytes are decoded and emitted as streaming chunks.
  5. Final parse resolves tool calls and stop reason.

Cancellation

engine.cancelGeneration();

Cancellation is immediate and backend-specific.

Tokenization helpers

final tokens = await engine.tokenize('hello world');
final text = await engine.detokenize(tokens);
final count = await engine.getTokenCount('hello world');

These helpers are useful for context budgeting and prompt diagnostics.

When to use which API

  • Use generate(...) when you already have a final raw prompt and do not need chat-template tooling.
  • Use create(...) for OpenAI-style message arrays, template routing, and tool-calling workflows.