Runtime Parameters
Runtime behavior is primarily controlled by:
ModelParamsat model load time.GenerationParamsper generation call.
ModelParams essentials
await engine.loadModel(
'/path/to/model.gguf',
modelParams: const ModelParams(
contextSize: 4096,
gpuLayers: ModelParams.maxGpuLayers,
preferredBackend: GpuBackend.vulkan,
numberOfThreads: 0,
numberOfThreadsBatch: 0,
),
);
Important fields:
contextSize: total context window.gpuLayers: number of layers offloaded to GPU.preferredBackend: backend preference (auto,vulkan,metal, etc).chatTemplate: optional template override.
For runtime LoRA control (setLora, removeLora, clearLoras), see
LoRA Adapters.
GenerationParams essentials
const params = GenerationParams(
maxTokens: 512,
temp: 0.7,
topK: 40,
topP: 0.9,
minP: 0.0,
penalty: 1.1,
stopSequences: ['</s>'],
);
Important fields:
maxTokens: generation length cap.temp: randomness.topK,topP,minP: token filtering controls.penalty: repeat penalty.seed: deterministic replay when set.grammar: constrained decoding with GBNF.
Practical tuning defaults
- Deterministic extraction: lower
temp(0.1-0.3) + explicit stops. - General chat:
temparound0.6-0.9,topParound0.9-0.95. - Tool calling: stable
tempand sufficientmaxTokensfor call payload.