Professional Multi-GPU Computing Cluster Automation
Based on a dynamic Bash shell script with RPC support
Main Bash script. Manages memory, backend selection interface (CUDA/ROCm/Vulkan), and server flags.
Download Script
On Linux systems, the script calculates available memory fully autonomously. It utilizes native tools for each environment:
• NVIDIA (CUDA): Direct readout from nvidia-smi.
• AMD (ROCm): Direct readout from rocm-smi.
The RPC server flag is automatically injected during startup. Set the variable below to add additional cluster nodes.
The script daemonizes the server process using nohup, then monitors its actual readiness via HTTP protocol, ensuring the model is fully loaded into VRAM before releasing the console:
./llama-run.py logs<port>.
llama-server instance, releasing VRAM before startup.
| Variable | Default Value / Description |
|---|---|
| MODEL_PATH | "/home/models/gguf" - Location of models on the server. |
| LLAMA_SERVER_PATH | "/home/doman/llama.cpp/bin/llama-server" - Binary build path. |
| CONTEXT | "-c 4000" - Defines the hard limit for the token buffer. |
| CACHE_TYPE_K / CACHE_TYPE_V | "q4_0" - Hardware quantization for K and V context memory. |
| OVERHEAD_MB | 1536 - Strict VRAM safety margin (buffer). |
| MAX_WAIT | 120 - Wait time in seconds before killing a hung process. |
| BACKEND | "" - Leave empty to force console environment prompt. |
| RPC_TARGETS | "" - Empty disables the --rpc flag injection. |
📂 /home/doman/ # User home directory ┣ 📂 llama.cpp/build-master/bin/ ┃ ┗ 🚀 llama-server # Compiled C++ server binaries ┣ 📂 start_llama/ ┃ ┗ 📜 llama-run.py # Automation Bash script (with +x) ┗ 📝 server.log # First instance live logs ┗ 📝 server8081.log # Live logs for subsequent ports 📂 /home/models/gguf/ # Weight file repository ┣ 📂 mmproj/ # Folder for vision libraries ┃ ┗ 👁️ mmproj-FINAL-Bench_Darwin.gguf ┣ 📦 Bielik-11B-v3.0-Instruct.Q4_K_M.gguf ┗ 📦 FINAL-Bench_Darwin-35B-A3B-Q8_0.gguf
To release VRAM and safely close virtual cluster cards after finishing work, type in the terminal: ./llama-run.py stop (for a single instance) or ./llama-run.py stopall (to force close all active processes). The script will send a termination signal (pkill) precisely targeting the respective server instances.
llama.cpp servers simultaneously.server.log. Each subsequent instance generates its own log file with the port number, e.g., server8081.log.mmproj libraries corresponding to the main model. The weight of the vision module is dynamically added to the overall VRAM requirement. The user decides whether to load it via an interactive prompt.PARALLEL).stop8080) precisely closes the server on the specified port.logs8081) live log preview for the selected server.
For the script to correctly detect and load the vision module, the library file must be located in a dedicated mmproj subdirectory inside your main model directory.
doman@LianLi:~/start_llama$ ./llama-run.py 🔍 Checking ports (8080-8085)... 🌐 Found server on port(s): 8080 👉 Enter port to update server or start new [8081]: 8081 🔧 Select environment (Backend): 1) CUDA (NVIDIA) 2) ROCm (AMD) 3) Vulkan (Combined GPU / All available in system) 👉 Your choice (1-3): 2 🧙♂️ Scanning directory /home/models/gguf... 📊 Total summed VRAM: 32752 MB 📊 Available VRAM: 27430 MB (Context buffer: 1536 MB) ✅ Models fitting in currently free VRAM: ------------------------------------------------------ 1) Bielik-11B-v3.0-Instruct.Q4_K_M.gguf [ 6,3G] 3) Bielik-4.5B-v3.0-Instruct.Q8_0.gguf [ 4,8G] 11) NVIDIA-Nemotron3-Nano-4B-Q4_K_M.gguf [ 2,7G] (Currently running on 8080) 15) FINAL-Bench_Darwin-35B-A3B-Q8_0.gguf [ 16G] [vision ✔ 0.9GB] 5) GLM-4.7-Flash-Q8_0.gguf [ 30G] (Requires freeing VRAM) ------------------------------------------------------ 👉 Choose model number (1-19) or press CTRL+C to cancel: 15 👉 VL library detected (mmproj-FINAL-Bench_Darwin.gguf). Load vision module? [Y/n]: Y ✅ Vision module will be loaded. 🧹 Closing previous llama-server instance on port 8081... 🚀 Starting server with model: FINAL-Bench_Darwin-35B-A3B-Q8_0.gguf on port 8081 ⏳ Waiting for model load and VRAM buffer allocation... > Verifying context size... ✅ Server started with model FINAL-Bench_Darwin-35B-A3B-Q8_0.gguf ✅ Server is running in background on port 8081. Context tokens set correctly. 📄 Logs saved to: /home/doman/server8081.log doman@LianLi:~/start_llama$