Diagnostic Tool for Large Language Models

Modern AI systems require highly precise tools capable of monitoring critical performance parameters in real-time. LlamaBench.py acts as a diagnostic station and dynamometer for your llama.cpp engine.

âš ī¸

Important Disclaimer

The majority of the data parsed by LlamaBench.py depends entirely on the server logs generated by the llama.cpp engine. Changes, updates, or format modifications made by the llama.cpp developers may cause parsing errors or missing data. This script is optimized and tested specifically against Server Build: b8262-23fbfcb1a.

Mandatory Flags: Remember that for full compatibility, llama.cpp must always be run with the flags: --metrics, --log-file ~/server.log, --log-colors off, and --flash-attn on.

License & Stability: All creations are distributed under the Apache 2.0 license. Please note that the programs and files may still contain bugs, as this project is developed "Just For Fun".

Understanding Screen Results & Memory

Interact with the terminal below to explore hardware mapping, then review the core KPIs.

user@server:~/llama$ python3 llamabench.py -r 2 -t 2
[*] Docker container 'llama-master-rtx' detected mapping external port 8080.
================================================================================
--- SYSTEM & HARDWARE --- | --- MEMORY ALLOCATION ---
Source:       Docker
Architecture: nemotron_h_moe
Port:         8080
MMAP Status:  ON
VRAM Model:   23197.47 MiB (53/53 layers) [ROCm0]
KV Cache:     5481.00 MiB [f16]
Compute Buf:  4596.59 MiB
SSM/RS State: 47.62 MiB
RAM Layers:   0.00 MiB
--------------------------------------------------------------------------------
Round 01: PP = 719.91 t/s | TG = 85.98 t/s | TTFT = 653.13 ms | Gen Time = 1488.67 ms (1.48 s) | Tokens = 128
Round 02: PP = 715.42 t/s | TG = 86.12 t/s | TTFT = 649.20 ms | Gen Time = 1485.10 ms (1.48 s) | Tokens = 128
================================================================================
đŸ–ąī¸

Interactive Terminal

Click any green dashed text in the console to reveal a deep-dive explanation of what exactly is being allocated and where.

Raw Memory Data Analysis

The following details are extracted directly from the server logs to provide exact allocation mapping across devices:

"VRAM Model": "23197.47 MiB (53/53 layers) [ROCm0: 23197.5]", "KV Cache": " 5481.00 MiB [f16] [ROCm0: 5481.0]", "Compute Buf": " 4596.59 MiB [ROCm0: 2759.1 | ROCm_Host: 1837.5]", "SSM/RS State": " 47.62 MiB [ROCm0: 47.6]", "RAM BPE/Meta": " 231.00 MiB", "RAM Layers": " 0.00 MiB (0/53 layers)"
🧩

Advanced Architectures (SSM & MoE)

Next-generation models like Nemotron-3-Nano-30B-A3B utilize hybrid approaches, blending State Space Models (SSM) with Mixture of Experts (MoE) techniques. LlamaBench.py accurately captures these unique memory footprints.

State Space Models (SSM/RS) Instead of traditional KV Cache, models like Mamba use a Recurrent State (RS) that requires significantly less memory and scales linearly with context length. The benchmark detects and logs this efficient memory allocation as SSM/RS State.
Mixture of Experts (MoE) These models contain multiple specialized sub-networks ("experts"). For any given word, only a fraction of these experts are activated, allowing a massive 30B parameter model to run with the speed and memory footprint of a much smaller one.
🧠

Hardware & Memory Allocation

LlamaBench.py extracts extremely precise memory distribution data from the logs. It distinguishes between the model's actual weights, the working context cache, and temporary calculation buffers.

VRAM Offloading: The tool pinpoints exactly how many layers of the model's "brain" fit into the ultra-fast GPU memory. Spilling over to regular RAM severely bottlenecks the entire system.
RPC Discovery: The script detects Remote Procedure Calls, meaning it can map out distributed workloads when multiple physical machines (workers) process a single model over the network.
âš–ī¸

Smart Comparison

The script calculates external response times and cross-references them with the server's internal declared metrics.

Impact: If the server claims 50 t/s but the benchmark calculates 30 t/s, it detects network bottlenecks or HTTP stack overloads on the host machine.
đŸ—œī¸

Quantization Formats (e.g., [q4_0])

Quantization is an AI model compression technique. It drastically reduces model size so they consume less memory, at the cost of a minor drop in precision. Here is a brief overview of the three main formats you will encounter:

f16 (Uncompressed) No compression. Each model weight is stored in sixteen bits. This is the original format guaranteeing maximum quality and flawless logic. However, it requires a gigantic amount of VRAM memory.
q8 (Light Compression) Trims weights to eight bits, reducing the overall model size by half. The precision drop here is practically zero and completely unnoticeable in normal conversation. It's the ideal sweet spot preserving perfect quality with significantly lower memory demands.
q4_0 (Strong Compression) Reduces weights to just four bits, making the model file four times smaller compared to the original. It provides the greatest memory savings but comes with a measurable drop in logic when solving highly complex problems.
📖

Prompt Processing (PP)

Measures how fast the server "reads" the user's query before generating a response. Expressed in tokens per second (t/s).

Impact: A high PP score (e.g., 719.91 t/s) means that even very long questions or attached documents are processed almost instantaneously.
âœī¸

Token Generation (TG)

The speed at which the model "writes" the answer on the screen. The most crucial parameter for end-users.

Impact: An average human reads 5-10 words/sec. A TG score above 30 t/s ensures absolute fluency with zero perceived waiting time.
âąī¸

Time To First Token (TTFT)

System latency. How many milliseconds pass from clicking "send" to the appearance of the first character.

Impact: Critical for interactive systems. A result under 200ms is practically imperceptible to humans, indicating excellent server optimization.
âŗ

Gen Time (Generation Time)

The total time elapsed since the appearance of the first token until the generation process completes.

Control Flag: Can be restricted using the -t N or --time N flag. By default it is unlimited, but setting e.g., -t 2 forces the round to stop after exactly 2 seconds of generation.
📝

Tokens (Generated Count)

The exact number of text chunks (tokens) successfully generated by the model during a single test round.

Control Flag: Regulated by the -c N or --context N flag. The default limit is 128 tokens, ensuring tests don't run indefinitely unless specified.

LlamaBench Logs UI

A dedicated web application placed in the same directory to visualize and manage benchmark reports.

🔐 Secure Authorization

Editing and file management functions require a password, verified server-side via SHA hash. Read-only access is public.

📊 Interactive Results Table

Dynamically displays GPUs, Model names, TG, PP, and Watts. Features dual search fields for primary filtering and narrowing down results without page reloads.

🧠 Smart Log Pairing

Automatically hides raw serverlog_*.json files from the main view, pairing them in the background with their corresponding benchmark reports.

⚡ Asynchronous Server Logs

Clicking a report opens a modal window that fetches full server execution logs asynchronously, preventing the main webpage from freezing.

Llamabench6.2 Statistics v2
Search... Import
GPUs Model TG Watts
[RPC0]
[CUDA0]
Bielik-11B-v3.0 36.67 180 W
[ROCm0] Nemotron-3-Nano 85.98 210 W

Llama.cpp Runner

Multi-GPU Computing Cluster Automation for Windows and Linux systems.

🐧

Linux Edition

Bash & Python Automation

Tools designed for Linux environments. Scripts independently manage engines (CUDA, ROCm, Vulkan) and fully autonomously calculate available VRAM using native GPU functions. They automate background process management (daemons) and safely release memory when switching models.

đŸĒŸ

Windows Edition

PowerShell Automation

An advanced PowerShell script that eliminates ghost processes. It dynamically queries the server via HTTP to verify full weight loading and RPC cluster readiness. It instantly releases hardware resources with a single keystroke, protecting the system from memory overflow.

Downloads Repository

Direct access to the script files and documentation stored on the server.

More About the Program

For a complete deep-dive into the architecture, precision stability tests, and dual-JSON logic, please refer to the included detailed documentation.

🙏

Special Thanks

Special thanks to David from Country Boy Computers. His videos helped me take my first steps with llama.cpp and the AMD MI50 accelerator.

This project is a direct evolution of his original llamabench2.py, developed as a token of gratitude for the time he dedicated to testing and his relentless fight against drivers, configurations, and the glorious chaos of rapidly evolving AI technology.

â–ļī¸ Visit Country Boy Computers