Skip to main content

Performance Tuning

Performance tuning depends on model size, quantization, backend availability, and context/generation settings.

Model load tuning (`ModelParams`)

const modelParams = ModelParams(
  contextSize: 4096,
  gpuLayers: ModelParams.maxGpuLayers,
  preferredBackend: GpuBackend.vulkan,
  numberOfThreads: 0,
  numberOfThreadsBatch: 0,
);

Guidelines:

Start with default gpuLayers and lower only if stability issues appear.
Keep contextSize only as large as your use case needs.
Use backend preference matching your target device/runtime.

Generation tuning (`GenerationParams`)

const generationParams = GenerationParams(
  maxTokens: 256,
  temp: 0.7,
  topK: 40,
  topP: 0.9,
  minP: 0.0,
  penalty: 1.1,
);

Guidelines:

Lower maxTokens for latency-sensitive paths.
Lower temp for deterministic/extraction tasks.
Adjust topP and topK gradually; avoid drastic simultaneous changes.

Practical diagnostics

Measure token throughput with representative prompts.
Validate memory behavior with your real context sizes.
Check runtime backend and VRAM info where available:

final backendName = await engine.getBackendName();
final vram = await engine.getVramInfo();
print('$backendName total=${vram.total} free=${vram.free}');

Model load tuning (ModelParams)
Generation tuning (GenerationParams)
Practical diagnostics