Skip to main content

Runtime Parameters

Runtime behavior is primarily controlled by:

ModelParams at model load time.
GenerationParams per generation call.

ModelParams essentials

await engine.loadModel(
  '/path/to/model.gguf',
  modelParams: const ModelParams(
    contextSize: 4096,
    gpuLayers: ModelParams.maxGpuLayers,
    preferredBackend: GpuBackend.vulkan,
    numberOfThreads: 0,
    numberOfThreadsBatch: 0,
  ),
);

Important fields:

contextSize: total context window.
gpuLayers: number of layers offloaded to GPU.
preferredBackend: backend preference (auto, vulkan, metal, etc).
chatTemplate: optional template override.

For runtime LoRA control (setLora, removeLora, clearLoras), see LoRA Adapters.

GenerationParams essentials

const params = GenerationParams(
  maxTokens: 512,
  temp: 0.7,
  topK: 40,
  topP: 0.9,
  minP: 0.0,
  penalty: 1.1,
  stopSequences: ['</s>'],
);

Important fields:

maxTokens: generation length cap.
temp: randomness.
topK, topP, minP: token filtering controls.
penalty: repeat penalty.
seed: deterministic replay when set.
grammar: constrained decoding with GBNF.

Practical tuning defaults

Deterministic extraction: lower temp (0.1-0.3) + explicit stops.
General chat: temp around 0.6-0.9, topP around 0.9-0.95.
Tool calling: stable temp and sufficient maxTokens for call payload.

ModelParams essentials
GenerationParams essentials
Practical tuning defaults