Llama.cpp Runner | Windows Guide

Professional automation and computing cluster management

This is a project for the lazy, created in free time using Gemini 3.1Pro "Just For Fun"

SERVER BUILD: b8468-3306dbaef High-Performance Inference Unit

Download Center

windows_llamacpp.rar 394 MB

Full MSVC binaries, CUDA 12 libraries, and RPC server.

Download Package

llama-run.ps1 9.1 KB

Source Download

start-llama.bat 183 B

Source Download

llamabench6.2.py 27.3 KB

Source Download

Archive / Other engine configurations

llama-run2.ps1 28.03.2026 13:25 • 9.48 KB

Download

llama-run3.ps1 28.03.2026 14:05 • 5.65 KB

Download

llama-run4.ps1 28.03.2026 15:46 • 7.36 KB

Download

llama-run5.ps1 28.03.2026 15:49 • 9.51 KB

Download

⚠️ Important note: AMD VRAM

Unlike Linux systems, Windows does not have a native function to return free VRAM for AMD cards in the console.

• NVIDIA (CUDA): Works fully automatically.
• AMD: Requires manual entry of capacity in the $AMD_VRAM_MB variable.

RPC Management (MI50)

By default, the script sums the local VRAM with the memory of the remote server 192.168.0.222.

# To disable:
$RPC_TARGETS = ""

📡 Status Monitoring Logic

The program uses rigorous readiness verification. Instead of closing the window immediately after starting the process, the script queries the HTTP server:

HTTP 503
Model is loading weights into GPU memory. The script waits patiently.

HTTP 200
Model is ready. Benchmark lock is released.

💡 System Tip: Sometimes the PowerShell console in Windows freezes (e.g., due to "Quick Edit" mode after a mouse click). If the progress bar seems to be stuck, press SPACE or ENTER, which will immediately force the refresh loop to resume.

Variable Configuration (.ps1 File)

Variable	Default Value / Description
$MODEL_PATH	`"C:\llama.cpp\x"` - Folder containing GGUF models.
$LLAMA_SERVER_PATH	`"C:\llama.cpp\build\bin\Release\llama-server.exe"`
$LOG_FILE	The variable `$env:USERPROFILE\server.log`.
$AMD_VRAM_MB	`0` - Manual configuration for AMD GPU.
$RPC_TARGETS	`"192.168.0.222:50052"` - RPC cluster node addresses.
$RPC_VRAM_MB	`32752` - Remote accelerator memory (e.g., MI50).
$PORT	`"8081"` - Listening port. Changed from 8080 to avoid collisions.
$CONTEXT	`"-c 4000"` - Context window limit.
$CACHE_TYPE_K/V	`"q4_0"` - Quantization for K and V cache.
$OVERHEAD_MB	`1536` - VRAM safety margin.
$MAX_WAIT	`600` - Loading timeout (10 minutes for heavy network models).

Physical Path Summary (Tree)

📂 C:\llama.cpp\                             # Main application environment directory
 ┣ 📂 build\
 ┃  ┗ 📂 bin\
 ┃     ┗ 📂 Release\
 ┃        ┗ 🚀 llama-server.exe              # Compiled server (Config: $LLAMA_SERVER_PATH)
 ┣ 📂 x\                                     # GGUF models folder (Config: $MODEL_PATH)
 ┃  ┣ 📦 gpt-oss-20b-Q4_K_M.gguf             # Example model
 ┃  ┗ 📦 Nemotron-3-Nano-30B-A3B-Q4_K_M.gguf # Example model
 ┣ 📜 llama-run.ps1                          # Logic script loading the hybrid cluster
 ┗ 📜 start-llama.bat                        # Shortcut to launch GUI (Can be moved to Desktop)

📂 C:\Users\YourName\                       # Home directory for $env:USERPROFILE variable
 ┗ 📝 server.log                             # Live logs generated by the server (Config: $LOG_FILE)

Safe Shutdown and VRAM Cleaning

The Runner is designed to prevent ghost processes. Upon completion, pressing ENTER in the main script window will send a Stop-Process signal, which immediately releases VRAM resources on all cards in the cluster (and breaks RPC connections). The script intelligently protects the system from overflow using the built-in --no-mmap flag.

Windows PowerShell

Launching Llama.cpp machine...
[SCAN] Scanning directory C:\llama.cpp\x...
[INFO] Total combined VRAM: 4096 MB                                                                                                     
[INFO] Available VRAM: 3962 MB (Context overhead: 1536 MB)                                                                              
[OK] Models fitting in currently available VRAM:                                                                                        
------------------------------------------------------
   2) Llama-3.2-3B-Instruct-Q4_K_M.gguf                  [  1.9G]
------------------------------------------------------
[WARN] Remaining models (exceed total VRAM):
------------------------------------------------------
   1) gpt-oss-20b-Q4_K_M.gguf                            [ 10.8G] (Out of system memory)
   3) Nemotron-3-Nano-30B-A3B-Q4_K_M.gguf                [ 22.9G] (Out of system memory)
------------------------------------------------------

>>> Select model number (1-3) or press CTRL+C to cancel: 2

[START] Starting server with model: Llama-3.2-3B-Instruct-Q4_K_M.gguf
[WAIT] Loading weights and allocating KV Cache (Max wait: 600 s)...
> Checking context size...

======================================================
 🟢 SERVER IS ONLINE (Port 8081)
======================================================
>>> Press [ENTER] to STOP the server and free VRAM: