Professional automation and computing cluster management
This is a project for the lazy, created in free time using Gemini 3.1Pro "Just For Fun"
Unlike Linux systems, Windows does not have a native function to return free VRAM for AMD cards in the console.
• NVIDIA (CUDA): Works fully automatically.
• AMD: Requires manual entry of capacity in the $AMD_VRAM_MB variable.
By default, the script sums the local VRAM with the memory of the remote server 192.168.0.222.
The program uses rigorous readiness verification. Instead of closing the window immediately after starting the process, the script queries the HTTP server:
💡 System Tip: Sometimes the PowerShell console in Windows freezes (e.g., due to "Quick Edit" mode after a mouse click). If the progress bar seems to be stuck, press SPACE or ENTER, which will immediately force the refresh loop to resume.
| Variable | Default Value / Description |
|---|---|
| $MODEL_PATH | "C:\llama.cpp\x" - Folder containing GGUF models. |
| $LLAMA_SERVER_PATH | "C:\llama.cpp\build\bin\Release\llama-server.exe" |
| $LOG_FILE | The variable $env:USERPROFILE\server.log. |
| $AMD_VRAM_MB | 0 - Manual configuration for AMD GPU. |
| $RPC_TARGETS | "192.168.0.222:50052" - RPC cluster node addresses. |
| $RPC_VRAM_MB | 32752 - Remote accelerator memory (e.g., MI50). |
| $PORT | "8081" - Listening port. Changed from 8080 to avoid collisions. |
| $CONTEXT | "-c 4000" - Context window limit. |
| $CACHE_TYPE_K/V | "q4_0" - Quantization for K and V cache. |
| $OVERHEAD_MB | 1536 - VRAM safety margin. |
| $MAX_WAIT | 600 - Loading timeout (10 minutes for heavy network models). |
📂 C:\llama.cpp\ # Main application environment directory ┣ 📂 build\ ┃ ┗ 📂 bin\ ┃ ┗ 📂 Release\ ┃ ┗ 🚀 llama-server.exe # Compiled server (Config: $LLAMA_SERVER_PATH) ┣ 📂 x\ # GGUF models folder (Config: $MODEL_PATH) ┃ ┣ 📦 gpt-oss-20b-Q4_K_M.gguf # Example model ┃ ┗ 📦 Nemotron-3-Nano-30B-A3B-Q4_K_M.gguf # Example model ┣ 📜 llama-run.ps1 # Logic script loading the hybrid cluster ┗ 📜 start-llama.bat # Shortcut to launch GUI (Can be moved to Desktop) 📂 C:\Users\YourName\ # Home directory for $env:USERPROFILE variable ┗ 📝 server.log # Live logs generated by the server (Config: $LOG_FILE)
The Runner is designed to prevent ghost processes. Upon completion, pressing ENTER in the main script window will send a Stop-Process signal, which immediately releases VRAM resources on all cards in the cluster (and breaks RPC connections). The script intelligently protects the system from overflow using the built-in --no-mmap flag.
Launching Llama.cpp machine... [SCAN] Scanning directory C:\llama.cpp\x... [INFO] Total combined VRAM: 4096 MB [INFO] Available VRAM: 3962 MB (Context overhead: 1536 MB) [OK] Models fitting in currently available VRAM: ------------------------------------------------------ 2) Llama-3.2-3B-Instruct-Q4_K_M.gguf [ 1.9G] ------------------------------------------------------ [WARN] Remaining models (exceed total VRAM): ------------------------------------------------------ 1) gpt-oss-20b-Q4_K_M.gguf [ 10.8G] (Out of system memory) 3) Nemotron-3-Nano-30B-A3B-Q4_K_M.gguf [ 22.9G] (Out of system memory) ------------------------------------------------------ >>> Select model number (1-3) or press CTRL+C to cancel: 2 [START] Starting server with model: Llama-3.2-3B-Instruct-Q4_K_M.gguf [WAIT] Loading weights and allocating KV Cache (Max wait: 600 s)... > Checking context size... ====================================================== 🟢 SERVER IS ONLINE (Port 8081) ====================================================== >>> Press [ENTER] to STOP the server and free VRAM: