Generation and Streaming
llamadart exposes two generation styles:
engine.generate(prompt)for raw prompt strings.engine.create(messages)for chat-template aware completions.
Generation pipeline (visual)
Low-level generation API
await for (final token in engine.generate(
'List two advantages of local LLM inference.',
params: const GenerationParams(maxTokens: 64, temp: 0.4),
)) {
print(token);
}
Chat completion API
final messages = [
LlamaChatMessage.fromText(
role: LlamaChatRole.user,
text: 'Explain top-p in plain language.',
),
];
await for (final chunk in engine.create(
messages,
params: const GenerationParams(maxTokens: 128, topP: 0.95),
)) {
final text = chunk.choices.first.delta.content;
if (text != null) {
print(text);
}
}
create(...) flow at a glance
- Build your
List<LlamaChatMessage>. engine.create(...)runs template rendering/parity logic.- Effective stop sequences and grammar are applied to generation params.
- Backend token bytes are decoded and emitted as streaming chunks.
- Final parse resolves tool calls and stop reason.
Cancellation
engine.cancelGeneration();
Cancellation is immediate and backend-specific.
Tokenization helpers
final tokens = await engine.tokenize('hello world');
final text = await engine.detokenize(tokens);
final count = await engine.getTokenCount('hello world');
These helpers are useful for context budgeting and prompt diagnostics.
When to use which API
- Use
generate(...)when you already have a final raw prompt and do not need chat-template tooling. - Use
create(...)for OpenAI-style message arrays, template routing, and tool-calling workflows.