Llama.cpp Runner | Linux Guide

Professional Multi-GPU Computing Cluster Automation

Based on a dynamic Bash shell script with RPC support

ENVIRONMENT: Linux (Bash) High-Performance Inference Unit

Download Center

llama-run.py 17 KB

Main Bash script. Manages memory, backend selection interface (CUDA/ROCm/Vulkan), and server flags.

Download Script

Archive / Additional Modules

llama-run_old.py 03.29 • 9.5 KB

Download

llama-run-pl.py 04.03 • 17 KB

Download

mi50.py 03.29 • 8.2 KB

Download

rtx.sh 03.29 • 4.2 KB

Download

hostrpc.sh 03.29 • 2.9 KB

Download

Automatic VRAM Detection

On Linux systems, the script calculates available memory fully autonomously. It utilizes native tools for each environment:

• NVIDIA (CUDA): Direct readout from nvidia-smi.
• AMD (ROCm): Direct readout from rocm-smi.

RPC Management

The RPC server flag is automatically injected during startup. Set the variable below to add additional cluster nodes.

# e.g., "127.0.0.1:50052,192.168.1.10:50052"
RPC_TARGETS=""

📡 Status Monitoring Logic

The script daemonizes the server process using nohup, then monitors its actual readiness via HTTP protocol, ensuring the model is fully loaded into VRAM before releasing the console:

No Response (Waiting)
Server is allocating weights and K/V buffer on the fly. A curl loop is active.

HTTP 200 on Port
Model is loaded. Live preview: ./llama-run.py logs<port>.

Automated Lifecycle
Selecting a new model on the same port automatically terminates the previous llama-server instance, releasing VRAM before startup.

Variable Configuration (Script File)

Variable	Default Value / Description
MODEL_PATH	`"/home/models/gguf"` - Location of models on the server.
LLAMA_SERVER_PATH	`"/home/doman/llama.cpp/bin/llama-server"` - Binary build path.
CONTEXT	`"-c 4000"` - Defines the hard limit for the token buffer.
CACHE_TYPE_K / CACHE_TYPE_V	`"q4_0"` - Hardware quantization for K and V context memory.
OVERHEAD_MB	`1536` - Strict VRAM safety margin (buffer).
MAX_WAIT	`120` - Wait time in seconds before killing a hung process.
BACKEND	`""` - Leave empty to force console environment prompt.
RPC_TARGETS	`""` - Empty disables the `--rpc` flag injection.

Physical Path Summary (Tree)

📂 /home/doman/                             # User home directory
 ┣ 📂 llama.cpp/build-master/bin/
 ┃  ┗ 🚀 llama-server                  # Compiled C++ server binaries
 ┣ 📂 start_llama/
 ┃  ┗ 📜 llama-run.py                  # Automation Bash script (with +x)
 ┗ 📝 server.log                         # First instance live logs
 ┗ 📝 server8081.log                     # Live logs for subsequent ports

📂 /home/models/gguf/                        # Weight file repository
 ┣ 📂 mmproj/                            # Folder for vision libraries
 ┃  ┗ 👁️ mmproj-FINAL-Bench_Darwin.gguf
 ┣ 📦 Bielik-11B-v3.0-Instruct.Q4_K_M.gguf
 ┗ 📦 FINAL-Bench_Darwin-35B-A3B-Q8_0.gguf

Safe Server Shutdown

To release VRAM and safely close virtual cluster cards after finishing work, type in the terminal: ./llama-run.py stop (for a single instance) or ./llama-run.py stopall (to force close all active processes). The script will send a termination signal (pkill) precisely targeting the respective server instances.

Update 1: Support for Multiple Instances, Vision Models, and Dynamic Ports

New Features and Changes

Multiple Instance Support: The script scans a defined port range (default 8080-8085) and allows running multiple independent llama.cpp servers simultaneously.
Extended Log System: The first instance creates a default server.log. Each subsequent instance generates its own log file with the port number, e.g., server8081.log.
Vision Module Detection (Multimodal): The script automatically searches resources for mmproj libraries corresponding to the main model. The weight of the vision module is dynamically added to the overall VRAM requirement. The user decides whether to load it via an interactive prompt.

New Control Commands

help - displays commands and variables (including the new PARALLEL).
stop - closes the instance (if only 1 is running) or lists active ones.
stopall - emergency closes all instances and frees VRAM.
stop<port> - (e.g., stop8080) precisely closes the server on the specified port.
logs<port> - (e.g., logs8081) live log preview for the selected server.

Vision Model Requirements (Important)

For the script to correctly detect and load the vision module, the library file must be located in a dedicated mmproj subdirectory inside your main model directory.

/home/models/gguf/ ➔ Main models
/home/models/gguf/mmproj/ ➔ Vision libraries (e.g., mmproj-model.gguf)

doman@LianLi: ~/start_llama

doman@LianLi:~/start_llama$ ./llama-run.py
🔍 Checking ports (8080-8085)...
🌐 Found server on port(s): 8080

👉 Enter port to update server or start new [8081]: 8081
🔧 Select environment (Backend):
  1) CUDA (NVIDIA)
  2) ROCm (AMD)
  3) Vulkan (Combined GPU / All available in system)
👉 Your choice (1-3): 2

🧙‍♂️ Scanning directory /home/models/gguf...
📊 Total summed VRAM: 32752 MB
📊 Available VRAM: 27430 MB (Context buffer: 1536 MB)

✅ Models fitting in currently free VRAM:
------------------------------------------------------
   1) Bielik-11B-v3.0-Instruct.Q4_K_M.gguf               [ 6,3G] 
   3) Bielik-4.5B-v3.0-Instruct.Q8_0.gguf                 [ 4,8G] 
  11) NVIDIA-Nemotron3-Nano-4B-Q4_K_M.gguf                [ 2,7G] (Currently running on 8080)
  15) FINAL-Bench_Darwin-35B-A3B-Q8_0.gguf                [  16G] [vision ✔ 0.9GB]
   5) GLM-4.7-Flash-Q8_0.gguf                             [  30G] (Requires freeing VRAM)
------------------------------------------------------

👉 Choose model number (1-19) or press CTRL+C to cancel: 15

👉 VL library detected (mmproj-FINAL-Bench_Darwin.gguf). Load vision module? [Y/n]: Y
✅ Vision module will be loaded.

🧹 Closing previous llama-server instance on port 8081...
🚀 Starting server with model: FINAL-Bench_Darwin-35B-A3B-Q8_0.gguf on port 8081
⏳ Waiting for model load and VRAM buffer allocation...

> Verifying context size...
✅ Server started with model FINAL-Bench_Darwin-35B-A3B-Q8_0.gguf
✅ Server is running in background on port 8081. Context tokens set correctly.
📄 Logs saved to: /home/doman/server8081.log
doman@LianLi:~/start_llama$