Modern AI systems require highly precise tools capable of monitoring critical performance parameters in real-time.
LlamaBench.py acts as a diagnostic station and dynamometer for your llama.cpp engine.
The majority of the data parsed by LlamaBench.py depends entirely on the server logs generated by the llama.cpp engine.
Changes, updates, or format modifications made by the llama.cpp developers may cause parsing errors or missing data.
This script is optimized and tested specifically against Server Build: b8262-23fbfcb1a.
Mandatory Flags: Remember that for full compatibility, llama.cpp must always be run with the flags: --metrics, --log-file ~/server.log, --log-colors off, and --flash-attn on.
License & Stability: All creations are distributed under the Apache 2.0 license. Please note that the programs and files may still contain bugs, as this project is developed "Just For Fun".
Interact with the terminal below to explore hardware mapping, then review the core KPIs.
Click any green dashed text in the console to reveal a deep-dive explanation of what exactly is being allocated and where.
The following details are extracted directly from the server logs to provide exact allocation mapping across devices:
Next-generation models like Nemotron-3-Nano-30B-A3B utilize hybrid approaches, blending State Space Models (SSM) with Mixture of Experts (MoE) techniques. LlamaBench.py accurately captures these unique memory footprints.
LlamaBench.py extracts extremely precise memory distribution data from the logs. It distinguishes between the model's actual weights, the working context cache, and temporary calculation buffers.
The script calculates external response times and cross-references them with the server's internal declared metrics.
Quantization is an AI model compression technique. It drastically reduces model size so they consume less memory, at the cost of a minor drop in precision. Here is a brief overview of the three main formats you will encounter:
Measures how fast the server "reads" the user's query before generating a response. Expressed in tokens per second (t/s).
The speed at which the model "writes" the answer on the screen. The most crucial parameter for end-users.
System latency. How many milliseconds pass from clicking "send" to the appearance of the first character.
The total time elapsed since the appearance of the first token until the generation process completes.
-t N or --time N flag. By default it is unlimited, but setting e.g., -t 2 forces the round to stop after exactly 2 seconds of generation.
The exact number of text chunks (tokens) successfully generated by the model during a single test round.
-c N or --context N flag. The default limit is 128 tokens, ensuring tests don't run indefinitely unless specified.
A dedicated web application placed in the same directory to visualize and manage benchmark reports.
Editing and file management functions require a password, verified server-side via SHA hash. Read-only access is public.
Dynamically displays GPUs, Model names, TG, PP, and Watts. Features dual search fields for primary filtering and narrowing down results without page reloads.
Automatically hides raw serverlog_*.json files from the main view, pairing them in the background with their corresponding benchmark reports.
Clicking a report opens a modal window that fetches full server execution logs asynchronously, preventing the main webpage from freezing.
| GPUs | Model | TG | Watts |
|---|---|---|---|
| [RPC0] [CUDA0] |
Bielik-11B-v3.0 | 36.67 | 180 W |
| [ROCm0] | Nemotron-3-Nano | 85.98 | 210 W |
Multi-GPU Computing Cluster Automation for Windows and Linux systems.
Bash & Python Automation
Tools designed for Linux environments. Scripts independently manage engines (CUDA, ROCm, Vulkan) and fully autonomously calculate available VRAM using native GPU functions. They automate background process management (daemons) and safely release memory when switching models.
PowerShell Automation
An advanced PowerShell script that eliminates ghost processes. It dynamically queries the server via HTTP to verify full weight loading and RPC cluster readiness. It instantly releases hardware resources with a single keystroke, protecting the system from memory overflow.
Direct access to the script files and documentation stored on the server.
For a complete deep-dive into the architecture, precision stability tests, and dual-JSON logic, please refer to the included detailed documentation.
Special thanks to David from Country Boy Computers. His videos helped me take my first steps with llama.cpp and the AMD MI50 accelerator.
This project is a direct evolution of his original llamabench2.py, developed as a token of gratitude for the time he dedicated to testing and his relentless fight against drivers, configurations, and the glorious chaos of rapidly evolving AI technology.