LlamaBench System Architecture

Comprehensive documentation for v6.3 (Core Engine) and v7.1 (Hardware Telemetry)

1. Command Line Interface (Help)

Usage: llamabench.py [OPTIONS]
Core Options:
  • -r, --rounds N Number of test rounds (default: 5). If set to 7, runs 7 test cycles.
  • -t, --time N Time limit per round in seconds (default: unlimited).
  • -c, --context N Context token limit per round (default: 128).
  • -p, --port N Target API port (default: 8080).
  • -h, --help Show this help message and exit.
Advanced Testing Flags (NEW):
  • -stw, --show-test-words
    Display the first 100 words of the stability test response. If not set, the stability test output is hidden to keep logs clean.
  • -cx, --context-fill N
    Enable incremental context overflow testing. Each round adds exactly N words of filler text to the prompt, simulating a progressively growing chat history to test cache performance.
  • -nt, --nothink
    Instruct the model to skip internal reasoning (adds reasoning_budget=0).

2. Parameter Architecture and Configuration

The benchmarking tool uses a flexible parameter architecture to control the workload dynamically. It evaluates execution flow based on mutually exclusive termination conditions:

  • Token Boundary (-c): The round strictly terminates when the generation reaches the specified token count.
  • Time Boundary (-t): The script switches to HTTP streaming mode, forcefully closing the connection when the absolute time limit (in seconds) is exceeded.
  • Context Overflow Protection (-cx): When running the VRAM degradation test, the script preemptively calculates if the requested rounds multiplied by the filler text will exceed the server's final_ctx. If it does, the tool safely limits the -r parameter to prevent the llama.cpp backend from crashing out of memory.

3. Innovative Docker-Host Environment Detection

LlamaBench does not require manual configuration of IPs or paths. It uses a prioritized heuristic discovery protocol:

  1. Local API Check: The script probes 127.0.0.1:[PORT]/v1/models. If the endpoint responds, Native Execution is assumed.
  2. Docker Socket Parsing: If the local probe fails, the script executes docker ps to parse container mappings. It identifies the exact container translating the requested external port.
  3. Log Extraction: Depending on the discovered source, it either reads ~/server.log directly from the filesystem or uses docker logs [container_name] to extract native internal metrics safely.

4. Key Performance Indicators (KPI)

The performance analysis is broken down into five critical metrics to evaluate different bottlenecks in the LLM pipeline:

PP (Prompt Processing): The speed (in tokens/sec) at which the model ingests the initial user query. When using -cx, this represents the mathematical speed of reading only the newly appended text chunk.
PPC (Prompt Processing Cached): Available only with -cx. Represents the global processing speed returned directly by the server, indicating how fast the GPU iterates over the previously frozen KV Cache buffer.
PPCT (Prompt Context Target): The current span of tokens/words acting as the filler block, visualizing the growing VRAM footprint (e.g., 25000 to 50000).
TG (Token Generation): The continuous output speed (tokens/sec) during the autoregressive generation phase.
TTFT (Time To First Token): The absolute latency (in milliseconds) before the server yields the first piece of data.

5. Hardware and VRAM Monitoring (v6.3 & v7.1)

LlamaBench operates in two distinct branches, sharing the same core memory extraction logic but differing in live hardware telemetry.

v6.3 / 7.1 Core Memory Extraction

By parsing the highly verbose llama.cpp startup logs, the script maps out the exact memory allocation footprint across devices:

  • VRAM Model: Static weights loaded onto the GPU(s).
  • KV Cache: Attention context size (with quantization type, e.g., q8_0).
  • Compute Buf: Temporary working memory allocated dynamically during matrix multiplication.
  • SSM/RS State: Mamba/RNN recurrent state memory blocks.

v7.1 Live Hardware Telemetry (Combo-Hybrid)

Version 7.1 spawns an asynchronous background thread that continuously polls hardware sensors using a multi-vendor fallback logic:

  • Power Draw: Tracks total system wattage using Intel RAPL for the CPU and rocm-smi/nvidia-smi for GPUs. Records idle (watts_start) and absolute peak values.
  • Thermals: Monitors Core, Junction, and HBM memory temperatures to detect throttling.
  • Clocks & Load: Extracts peak SCLK/MCLK frequencies and maximum GPU utilization percentages during the benchmark.

6. Reporting System

To support long-term archival and data analysis, LlamaBench generates two heavily detailed JSON files per run, stored in the llamabench_logs directory.

result_*.json

A clean, lightweight summary containing hardware specs, parsed VRAM allocations, final averaged KPIs (PP, TG, TTFT), and human-readable round_summaries strings. Perfect for quick parsing by visualization tools.

serverlog_*.json

A heavy archival file containing the raw, unedited dump of the llama.cpp output. Crucially, this file now also houses the test_data array, capturing the exact prompt sent and the full textual response received for every single round.

7. Architecture of New Models and the [CACHED PROMPT] Phenomenon

During testing, you may encounter the [CACHED PROMPT] tag accompanying significantly faster PP times. This is the script's internal heuristic detecting that the LLM engine skipped recalculating the attention matrices because the identical prompt history was found frozen in the VRAM.

Transformer Architecture (Transformers)

Models based on the standard architecture (like Llama 3 or Qwen) store the history of the conversation in the form of a massive KV Cache matrix. If the beginning of the prompt matches a previous query, the server bypasses the mathematical heavy-lifting, drastically reducing TTFT.

Mamba Architecture (RNN/SSM)

Instead of ballooning the KV Cache, these models "compress" the context into a fixed, small-sized Recurrent State (RS). This allows Mamba models to operate up to 5x faster with long contexts without devouring VRAM. However, a specific trait of this architecture is the difficulty of resuming operation from a saved state, which often forces a rapid recalculation of the prompt from scratch.


🔍 Diagnostic Summary: The appearance of the [CACHED PROMPT] tag and a significant drop in TTFT values prove that the hardware layer (VRAM) is communicating correctly with the backend. The absence of this tag in hybrid models (e.g., the nemotron_h_moe architecture) is not a software error, but stems from the specific handling of recurrent memory (RS).

8. Benchmark Executions (v7.1 Telemetry)

Below are live simulations of the terminal output. The first demonstrates a standard API load test. The second demonstrates the intense -cx (Context Fill) test designed to push VRAM usage to the limit.

Standard Execution (5 Rounds, Default Context)

doman@LianLi: ~/samba/llama_bench
doman@LianLi:/home/samba/llama_bench$ ./llamabench7.1.py Power monitor initialized, waiting for hardware readings (1s)... [WARNING] Sudo privileges missing for accurate CPU tracking. Using CPU_WAT_FALLBACK. [*] Local API detected on 127.0.0.1:8080. Assuming native host execution. ========================================================================================================= --- SYSTEM & HARDWARE --- | --- MEMORY ALLOCATION --- --------------------------------------------------------------------------------------------------------- Source: Native Host | VRAM Model: 26784.89 MiB (41/41 layers) [ROCm0: 18470.9 | ROCm1: 8314.0] System: Linux 6.17.0-20-generic | KV Cache: 5317.81 MiB [q8_0] [ROCm0: 3722.5 | ROCm1: 1595.3] Architecture: qwen35moe | Compute Buf: 5867.65 MiB [ROCm0: 2345.4 | ROCm1: 1511.1 | ROCm_Host: 2011.1] Detected Port:8080 | SSM/RS State: 125.63 MiB [ROCm0: 87.9 | ROCm1: 37.7] MoE Experts: 8 active / 256 total | RAM BPE/Meta: 397.85 MiB MMAP Status: ON | RAM Layers: 0.00 MiB (0/41 layers) --------------------------------------------------------------------------------------------------------- CPU Model: 12th Gen Intel(R) Core(TM) i5-12400F GPU Drivers: ROCM: 7.2.1.70201, VULKAN: 1.3.275 GPUs: [ROCm0] AMD Radeon Graphics (32732 MiB) [ROCm1] AMD Radeon VII (16340 MiB) ========================================================================================================= ============================================================ --- System Breakdown --- Server Build: b8651-d3416a4aa Context Limit: 512,512 tokens ------------------------------------------------------------ Benchmarking: Qwen3.5-35B-A3B-heretic-v2.i1-Q6_K.gguf (5 rounds) ------------------------------------------------------------ Stability: OK. (Response hidden, use -stw to view) ------------------------------------------------------------ Round 01: PP = 628.56 t/s | TG = 50.97 t/s | TTFT = 790.70 ms | Gen Time = 2511.24 ms (2.51 s) | Tokens = 128 Round 02: PP = 628.68 t/s | TG = 50.89 t/s | TTFT = 787.36 ms | Gen Time = 2515.08 ms (2.52 s) | Tokens = 128 Round 03: PP = 622.91 t/s | TG = 50.69 t/s | TTFT = 794.65 ms | Gen Time = 2525.08 ms (2.53 s) | Tokens = 128 Round 04: PP = 628.32 t/s | TG = 51.08 t/s | TTFT = 789.41 ms | Gen Time = 2505.75 ms (2.51 s) | Tokens = 128 Round 05: PP = 627.28 t/s | TG = 51.09 t/s | TTFT = 790.71 ms | Gen Time = 2505.23 ms (2.51 s) | Tokens = 128 ============================================================ FINAL AVERAGES - Qwen3.5-35B-A3B-heretic-v2.i1-Q6_K.gguf ------------------------------------------------------------ Configured Token Limit (TG): 128 Average Tokens Generated: 128.0 tokens Average Latency (TTFT): 790.57 ms Average Gen Time (TG): 2512.48 ms (2.51 s) 📈 Token Generation summary (tok/s) benchmark (average): ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ 50.9 server data: ▓▓▓▓▓▓▓▓▓▓▓▓▓▓ 30.4 (-40.3%) 📈 Prompt Processing summary (tok/s) benchmark (average): ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ 627.2 server data: ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ 505.9 (-19.3%) ============================================================ Report generated successfully in '/home/samba/llama_bench/llamabench_logs': - Main Results: result_Qwen3.5-35B-A3B-heretic-v2.i1-Q6_K_ROCm0-ROCm1_ctx512512_linux.json - Server Logs: serverlog_Qwen3.5-35B-A3B-heretic-v2.i1-Q6_K_ROCm0-ROCm1_ctx512512_linux.json

Context Overflow Degradation Test (-cx)

doman@LianLi: ~/samba/llama_bench
doman@LianLi:/home/samba/llama_bench$ ./llamabench7.1.py -cx 25000 -nt Power monitor initialized, waiting for hardware readings (1s)... [WARNING] Sudo privileges missing for accurate CPU tracking. Using CPU_WAT_FALLBACK. [*] Local API detected on 127.0.0.1:8080. Assuming native host execution. ========================================================================================================= --- SYSTEM & HARDWARE --- | --- MEMORY ALLOCATION --- --------------------------------------------------------------------------------------------------------- Source: Native Host | VRAM Model: 26784.89 MiB (41/41 layers) [ROCm0: 18470.9 | ROCm1: 8314.0] System: Linux 6.17.0-20-generic | KV Cache: 5317.81 MiB [q8_0] [ROCm0: 3722.5 | ROCm1: 1595.3] Architecture: qwen35moe | Compute Buf: 5867.65 MiB [ROCm0: 2345.4 | ROCm1: 1511.1 | ROCm_Host: 2011.1] Detected Port:8080 | SSM/RS State: 125.63 MiB [ROCm0: 87.9 | ROCm1: 37.7] MoE Experts: 8 active / 256 total | RAM BPE/Meta: 397.85 MiB MMAP Status: ON | RAM Layers: 0.00 MiB (0/41 layers) --------------------------------------------------------------------------------------------------------- CPU Model: 12th Gen Intel(R) Core(TM) i5-12400F GPU Drivers: ROCM: 7.2.1.70201, VULKAN: 1.3.275 GPUs: [ROCm0] AMD Radeon Graphics (32732 MiB) [ROCm1] AMD Radeon VII (16340 MiB) ========================================================================================================= ============================================================ --- System Breakdown --- Server Build: b8651-d3416a4aa Context Limit: 512,512 tokens ------------------------------------------------------------ Benchmarking: Qwen3.5-35B-A3B-heretic-v2.i1-Q6_K.gguf (5 rounds) ------------------------------------------------------------ Stability: OK. (Response hidden, use -stw to view) ------------------------------------------------------------ Round 01 PP : 849.28 t/s | PPCT = 0 to 25000 | PPC = 850.16 t/s | TG = 42.29 t/s | TTFT = 29460.21 ms | Gen Time = 3026.67 ms | Tokens = 128 Round 02 PP : 573.64 t/s | PPCT = 25000 to 50000 | PPC = 585.20 t/s | TG = 35.94 t/s | TTFT = 43615.89 ms | Gen Time = 3561.16 ms | Tokens = 128 Round 03 PP : 440.32 t/s | PPCT = 50000 to 75000 | PPC = 449.19 t/s | TG = 31.58 t/s | TTFT = 56822.08 ms | Gen Time = 4053.65 ms | Tokens = 128 Round 04 PP : 356.73 t/s | PPCT = 75000 to 100000 | PPC = 363.92 t/s | TG = 28.00 t/s | TTFT = 70137.19 ms | Gen Time = 4572.01 ms | Tokens = 128 Round 05 PP : 300.44 t/s | PPCT = 100000 to 125000 | PPC = 306.50 t/s | TG = 25.05 t/s | TTFT = 83276.91 ms | Gen Time = 5110.55 ms | Tokens = 128 ============================================================ FINAL AVERAGES - Qwen3.5-35B-A3B-heretic-v2.i1-Q6_K.gguf ------------------------------------------------------------ Configured Token Limit (TG): 128 Average Tokens Generated: 128.0 tokens Average Latency (TTFT): 56662.46 ms Average Gen Time (TG): 4064.81 ms (4.06 s) 📈 Token Generation summary (tok/s) benchmark (average): ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ 32.6 server data: ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ 30.5 (-6.5%) 📈 Prompt Processing Cached (PPC) summary (tok/s) benchmark (average): ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ 511.0 server data: ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ 513.0 (+0.4%) ⚠️ PERFORMANCE WARNING: VRAM Capping Detected (GTT Leak) (+1278.5 MB RAM spiked) ============================================================ Report generated successfully in '/home/samba/llama_bench/llamabench_logs': - Main Results: result_Qwen3.5-35B-A3B-heretic-v2.i1-Q6_K_ROCm0-ROCm1_ctx512512_linux.json - Server Logs: serverlog_Qwen3.5-35B-A3B-heretic-v2.i1-Q6_K_ROCm0-ROCm1_ctx512512_linux.json doman@LianLi:/home/samba/llama_bench$ _