Generation and Streaming

llamadart exposes two generation styles:

engine.generate(prompt) for raw prompt strings.
engine.create(messages) for chat-template aware completions.

Generation pipeline (visual)

Low-level generation API

await for (final token in engine.generate(
  'List two advantages of local LLM inference.',
  params: const GenerationParams(maxTokens: 64, temp: 0.4),
)) {
  print(token);
}

Chat completion API

final messages = [
  LlamaChatMessage.fromText(
    role: LlamaChatRole.user,
    text: 'Explain top-p in plain language.',
  ),
];

await for (final chunk in engine.create(
  messages,
  params: const GenerationParams(maxTokens: 128, topP: 0.95),
)) {
  final text = chunk.choices.first.delta.content;
  if (text != null) {
    print(text);
  }
}

`create(...)` flow at a glance

Build your List<LlamaChatMessage>.
engine.create(...) runs template rendering/parity logic.
Effective stop sequences and grammar are applied to generation params.
Backend token bytes are decoded and emitted as streaming chunks.
Final parse resolves tool calls and stop reason.

Cancellation

engine.cancelGeneration();

Cancellation is immediate and backend-specific.

Tokenization helpers

final tokens = await engine.tokenize('hello world');
final text = await engine.detokenize(tokens);
final count = await engine.getTokenCount('hello world');

These helpers are useful for context budgeting and prompt diagnostics.

When to use which API

Use generate(...) when you already have a final raw prompt and do not need chat-template tooling.
Use create(...) for OpenAI-style message arrays, template routing, and tool-calling workflows.

Generation pipeline (visual)​

Low-level generation API​

Chat completion API​

create(...) flow at a glance​

Cancellation​

Tokenization helpers​

When to use which API​

Generation pipeline (visual)

Low-level generation API

Chat completion API

`create(...)` flow at a glance

Cancellation

Tokenization helpers

When to use which API