Node Sizing
This guide helps you determine how many GPUs you need, how the VRAM budget is allocated, and how shards are distributed across nodes.
VRAM Requirements by Model
All figures assume Q4_K_M quantization (~4.5 bits per weight). VRAM shown is the minimum required after applying the 85% fill factor. Practical recommendation: add 20% headroom for KV cache growth under concurrent load.
| Model | Parameters | Weights (Q4_K_M) | Min nodes × 24 GB | Min nodes × 16 GB |
|---|---|---|---|---|
| TinyLlama 1.1B | 1.1B | 0.7 GB | 1 | 1 |
| Llama 3 8B | 8B | 4.9 GB | 1 | 1 |
| Mistral 7B | 7B | 4.4 GB | 1 | 1 |
| Llama 3 70B | 70B | 42 GB | 2 | 3 |
| Mixtral 8×7B (MoE) | 46.7B | 28 GB | 2 | 2 |
| Llama 3 405B | 405B | 243 GB | 11 | 16 |
How to Calculate
weights_gb = parameters × 0.55 (Q4_K_M ≈ 4.5 bits per weight)
usable_vram = total_vram_gb × 0.85 (VRAM_FILL_FACTOR, 15% reserved for KV + overhead)
min_nodes = ceil(weights_gb / usable_vram_per_node)
For Llama 3 70B across 2× RTX 4090 (24 GB each):
weights_gb = 70 × 0.55 = 38.5 GB
usable_vram = 2 × 24 × 0.85 = 40.8 GB
38.5 GB < 40.8 GB ✓ (fits with 2 nodes, 2.3 GB headroom)
VRAM_FILL_FACTOR Explained
The Hub reserves 15% of each node's VRAM for:
- KV cache pages — PagedAttention blocks for in-flight requests
- Framework overhead — PyTorch/llama-cpp runtime, CUDA context
- Headroom buffer — prevents OOM during burst load
For a 24 GB card: 24 × 0.85 = 20.4 GB usable. The remaining 3.6 GB is the KV cache budget.
KV cache capacity at default settings (block_size=16, max_blocks=2048):
KV per block ≈ 16 tokens × num_kv_heads × head_dim × 2 layers × 2 (K+V) × 2 bytes (FP16)
≈ 16 × 8 × 128 × (layers_on_node / total_layers) × 2 × 2
For Llama 3 70B with 20 layers on a node: ~65 KB per block, 2048 blocks = 133 MB.
This accommodates ~2000 concurrent requests at average 16 tokens of KV history — more than sufficient for most workloads.
Heterogeneous Nodes
PanGalactic supports mixing different GPU models in the same cluster. The shard assignment algorithm allocates layers proportionally to available VRAM:
node_0 (RTX 4090, 24 GB): fraction = 24 / (24 + 16 + 16) = 43% → ~34 of 80 layers
node_1 (RTX 3060, 16 GB): fraction = 16 / 56 = 29% → ~23 layers
node_2 (RTX 3060, 16 GB): fraction = 16 / 56 = 29% → ~23 layers
Faster GPUs naturally get more layers and more compute work. This is correct behaviour — a bottleneck at a slow node would stall the entire pipeline.
GPU Type Recommendations
VRAM (most important factor)
| VRAM | GPU options | Best for |
|---|---|---|
| 24 GB | RTX 4090, RTX 3090, RTX 3090 Ti | Primary recommendation |
| 16 GB | RTX 4080, RTX 3080 Ti | Secondary; need more nodes for large models |
| 12 GB | RTX 4070 Ti, RTX 3080 | Only viable for small models (≤13B) |
| 8 GB | RTX 4070, RTX 3070 | Development only; most models won't fit |
Memory Bandwidth (affects decode speed)
Decode throughput is memory-bandwidth-limited, not compute-limited. Higher memory bandwidth = faster tokens.
| GPU | Memory Bandwidth | Relative decode speed |
|---|---|---|
| RTX 4090 | 1,008 GB/s | 1.0× (baseline) |
| RTX 3090 | 936 GB/s | 0.93× |
| RTX 4080 | 717 GB/s | 0.71× |
| RTX 3080 | 760 GB/s | 0.75× |
For latency-sensitive applications, put high-bandwidth GPUs in the last pipeline stage (where decode sampling happens).
Kubernetes GPU Node Labelling
For automatic DaemonSet scheduling, label GPU nodes before deploying:
# Label a node as a PanGalactic GPU worker
kubectl label node <node-name> pangalactic.io/gpu=true
# Optional: label with GPU model for affinity rules
kubectl label node <node-name> pangalactic.io/gpu-model=rtx4090
# Verify
kubectl get nodes -l pangalactic.io/gpu=true
The DaemonSet in deploy/kubernetes/node/daemonset.yaml uses nodeSelector: pangalactic.io/gpu: "true" to ensure only labelled nodes run the Node agent.
Network Requirements
| Link | Required | Recommended |
|---|---|---|
| Node ↔ Node (StarStream) | ≥ 1 Gbps | 10 Gbps |
| Node ↔ Hub (gRPC) | ≥ 100 Mbps | 1 Gbps |
| Gateway ↔ Hub (gRPC) | ≥ 100 Mbps | 1 Gbps |
At 1 Gbps, Nebula automatically switches all links to INT8 compression (halving transfer size). Prefill hop latency becomes ~12.8 ms per hop instead of ~6.4 ms, which is acceptable for most use cases.
At 100 Mbps, the system still works but prefill latency increases to ~64 ms per hop. Only suitable for very short prompts or offline batch inference.
See Also
- Quickstart — stand up a dev cluster
- Production Deployment — Kubernetes setup
- Architecture: Pipeline Parallelism — bandwidth math