Comprehensive documentation for v6.3 (Core Engine) and v7.1 (Hardware Telemetry)
The benchmarking tool uses a flexible parameter architecture to control the workload dynamically. It evaluates execution flow based on mutually exclusive termination conditions:
(-c): The round strictly terminates when the generation reaches the specified token count.(-t): The script switches to HTTP streaming mode, forcefully closing the connection when the absolute time limit (in seconds) is exceeded.(-cx): When running the VRAM degradation test, the script preemptively calculates if the requested rounds multiplied by the filler text will exceed the server's final_ctx. If it does, the tool safely limits the -r parameter to prevent the llama.cpp backend from crashing out of memory.LlamaBench does not require manual configuration of IPs or paths. It uses a prioritized heuristic discovery protocol:
127.0.0.1:[PORT]/v1/models. If the endpoint responds, Native Execution is assumed.docker ps to parse container mappings. It identifies the exact container translating the requested external port.~/server.log directly from the filesystem or uses docker logs [container_name] to extract native internal metrics safely.The performance analysis is broken down into five critical metrics to evaluate different bottlenecks in the LLM pipeline:
-cx, this represents the mathematical speed of reading only the newly appended text chunk.
-cx. Represents the global processing speed returned directly by the server, indicating how fast the GPU iterates over the previously frozen KV Cache buffer.
LlamaBench operates in two distinct branches, sharing the same core memory extraction logic but differing in live hardware telemetry.
By parsing the highly verbose llama.cpp startup logs, the script maps out the exact memory allocation footprint across devices:
q8_0).Version 7.1 spawns an asynchronous background thread that continuously polls hardware sensors using a multi-vendor fallback logic:
rocm-smi/nvidia-smi for GPUs. Records idle (watts_start) and absolute peak values.To support long-term archival and data analysis, LlamaBench generates two heavily detailed JSON files per run, stored in the llamabench_logs directory.
A clean, lightweight summary containing hardware specs, parsed VRAM allocations, final averaged KPIs (PP, TG, TTFT), and human-readable round_summaries strings. Perfect for quick parsing by visualization tools.
A heavy archival file containing the raw, unedited dump of the llama.cpp output. Crucially, this file now also houses the test_data array, capturing the exact prompt sent and the full textual response received for every single round.
During testing, you may encounter the [CACHED PROMPT] tag accompanying significantly faster PP times. This is the script's internal heuristic detecting that the LLM engine skipped recalculating the attention matrices because the identical prompt history was found frozen in the VRAM.
Models based on the standard architecture (like Llama 3 or Qwen) store the history of the conversation in the form of a massive KV Cache matrix. If the beginning of the prompt matches a previous query, the server bypasses the mathematical heavy-lifting, drastically reducing TTFT.
Instead of ballooning the KV Cache, these models "compress" the context into a fixed, small-sized Recurrent State (RS). This allows Mamba models to operate up to 5x faster with long contexts without devouring VRAM. However, a specific trait of this architecture is the difficulty of resuming operation from a saved state, which often forces a rapid recalculation of the prompt from scratch.
nemotron_h_moe architecture) is not a software error, but stems from the specific handling of recurrent memory (RS).
Below are live simulations of the terminal output. The first demonstrates a standard API load test. The second demonstrates the intense -cx (Context Fill) test designed to push VRAM usage to the limit.