Llama.cpp Runner | Windows Guide

Professional automation and computing cluster management

This is a project for the lazy, created in free time using Gemini 3.1Pro "Just For Fun"

SERVER BUILD: b8468-3306dbaef High-Performance Inference Unit

Download Center

windows_llamacpp.rar 394 MB

Full MSVC binaries, CUDA 12 libraries, and RPC server.

Download Package
llama-run.ps1 9.1 KB
start-llama.bat 183 B
llamabench6.2.py 27.3 KB

Archive / Other engine configurations

llama-run2.ps1 28.03.2026 13:25 • 9.48 KB
Download
llama-run3.ps1 28.03.2026 14:05 • 5.65 KB
Download
llama-run4.ps1 28.03.2026 15:46 • 7.36 KB
Download
llama-run5.ps1 28.03.2026 15:49 • 9.51 KB
Download

⚠️ Important note: AMD VRAM

Unlike Linux systems, Windows does not have a native function to return free VRAM for AMD cards in the console.

NVIDIA (CUDA): Works fully automatically.
AMD: Requires manual entry of capacity in the $AMD_VRAM_MB variable.

RPC Management (MI50)

By default, the script sums the local VRAM with the memory of the remote server 192.168.0.222.

# To disable:
$RPC_TARGETS = ""

📡 Status Monitoring Logic

The program uses rigorous readiness verification. Instead of closing the window immediately after starting the process, the script queries the HTTP server:

HTTP 503
Model is loading weights into GPU memory. The script waits patiently.
HTTP 200
Model is ready. Benchmark lock is released.

💡 System Tip: Sometimes the PowerShell console in Windows freezes (e.g., due to "Quick Edit" mode after a mouse click). If the progress bar seems to be stuck, press SPACE or ENTER, which will immediately force the refresh loop to resume.

Variable Configuration (.ps1 File)

Variable Default Value / Description
$MODEL_PATH "C:\llama.cpp\x" - Folder containing GGUF models.
$LLAMA_SERVER_PATH "C:\llama.cpp\build\bin\Release\llama-server.exe"
$LOG_FILE The variable $env:USERPROFILE\server.log.
$AMD_VRAM_MB 0 - Manual configuration for AMD GPU.
$RPC_TARGETS "192.168.0.222:50052" - RPC cluster node addresses.
$RPC_VRAM_MB 32752 - Remote accelerator memory (e.g., MI50).
$PORT "8081" - Listening port. Changed from 8080 to avoid collisions.
$CONTEXT "-c 4000" - Context window limit.
$CACHE_TYPE_K/V "q4_0" - Quantization for K and V cache.
$OVERHEAD_MB 1536 - VRAM safety margin.
$MAX_WAIT 600 - Loading timeout (10 minutes for heavy network models).

Physical Path Summary (Tree)

📂 C:\llama.cpp\                             # Main application environment directory
  📂 build\
    📂 bin\
       📂 Release\
          🚀 llama-server.exe              # Compiled server (Config: $LLAMA_SERVER_PATH)
  📂 x\                                     # GGUF models folder (Config: $MODEL_PATH)
    📦 gpt-oss-20b-Q4_K_M.gguf             # Example model
    📦 Nemotron-3-Nano-30B-A3B-Q4_K_M.gguf # Example model
  📜 llama-run.ps1                          # Logic script loading the hybrid cluster
  📜 start-llama.bat                        # Shortcut to launch GUI (Can be moved to Desktop)

📂 C:\Users\YourName\                       # Home directory for $env:USERPROFILE variable
  📝 server.log                             # Live logs generated by the server (Config: $LOG_FILE)

Safe Shutdown and VRAM Cleaning

The Runner is designed to prevent ghost processes. Upon completion, pressing ENTER in the main script window will send a Stop-Process signal, which immediately releases VRAM resources on all cards in the cluster (and breaks RPC connections). The script intelligently protects the system from overflow using the built-in --no-mmap flag.

Windows PowerShell
Launching Llama.cpp machine...
[SCAN] Scanning directory C:\llama.cpp\x...
[INFO] Total combined VRAM: 4096 MB                                                                                                     
[INFO] Available VRAM: 3962 MB (Context overhead: 1536 MB)                                                                              
[OK] Models fitting in currently available VRAM:                                                                                        
------------------------------------------------------
   2) Llama-3.2-3B-Instruct-Q4_K_M.gguf                  [  1.9G]
------------------------------------------------------
[WARN] Remaining models (exceed total VRAM):
------------------------------------------------------
   1) gpt-oss-20b-Q4_K_M.gguf                            [ 10.8G] (Out of system memory)
   3) Nemotron-3-Nano-30B-A3B-Q4_K_M.gguf                [ 22.9G] (Out of system memory)
------------------------------------------------------

>>> Select model number (1-3) or press CTRL+C to cancel: 2

[START] Starting server with model: Llama-3.2-3B-Instruct-Q4_K_M.gguf
[WAIT] Loading weights and allocating KV Cache (Max wait: 600 s)...
> Checking context size...

======================================================
 🟢 SERVER IS ONLINE (Port 8081)
======================================================
>>> Press [ENTER] to STOP the server and free VRAM: