Node Sizing

This guide helps you determine how many GPUs you need, how the VRAM budget is allocated, and how shards are distributed across nodes.


VRAM Requirements by Model

All figures assume Q4_K_M quantization (~4.5 bits per weight). VRAM shown is the minimum required after applying the 85% fill factor. Practical recommendation: add 20% headroom for KV cache growth under concurrent load.

Model Parameters Weights (Q4_K_M) Min nodes × 24 GB Min nodes × 16 GB
TinyLlama 1.1B 1.1B 0.7 GB 1 1
Llama 3 8B 8B 4.9 GB 1 1
Mistral 7B 7B 4.4 GB 1 1
Llama 3 70B 70B 42 GB 2 3
Mixtral 8×7B (MoE) 46.7B 28 GB 2 2
Llama 3 405B 405B 243 GB 11 16

How to Calculate

weights_gb  = parameters × 0.55  (Q4_K_M ≈ 4.5 bits per weight)
usable_vram = total_vram_gb × 0.85  (VRAM_FILL_FACTOR, 15% reserved for KV + overhead)
min_nodes   = ceil(weights_gb / usable_vram_per_node)

For Llama 3 70B across 2× RTX 4090 (24 GB each):

weights_gb  = 70 × 0.55 = 38.5 GB
usable_vram = 2 × 24 × 0.85 = 40.8 GB
38.5 GB < 40.8 GB  ✓  (fits with 2 nodes, 2.3 GB headroom)

VRAM_FILL_FACTOR Explained

The Hub reserves 15% of each node's VRAM for:

  1. KV cache pages — PagedAttention blocks for in-flight requests
  2. Framework overhead — PyTorch/llama-cpp runtime, CUDA context
  3. Headroom buffer — prevents OOM during burst load

For a 24 GB card: 24 × 0.85 = 20.4 GB usable. The remaining 3.6 GB is the KV cache budget.

KV cache capacity at default settings (block_size=16, max_blocks=2048):

KV per block ≈ 16 tokens × num_kv_heads × head_dim × 2 layers × 2 (K+V) × 2 bytes (FP16)
            ≈ 16 × 8 × 128 × (layers_on_node / total_layers) × 2 × 2

For Llama 3 70B with 20 layers on a node: ~65 KB per block, 2048 blocks = 133 MB.

This accommodates ~2000 concurrent requests at average 16 tokens of KV history — more than sufficient for most workloads.


Heterogeneous Nodes

PanGalactic supports mixing different GPU models in the same cluster. The shard assignment algorithm allocates layers proportionally to available VRAM:

node_0 (RTX 4090, 24 GB): fraction = 24 / (24 + 16 + 16) = 43%  → ~34 of 80 layers
node_1 (RTX 3060, 16 GB): fraction = 16 / 56 = 29%               → ~23 layers
node_2 (RTX 3060, 16 GB): fraction = 16 / 56 = 29%               → ~23 layers

Faster GPUs naturally get more layers and more compute work. This is correct behaviour — a bottleneck at a slow node would stall the entire pipeline.


GPU Type Recommendations

VRAM (most important factor)

VRAM GPU options Best for
24 GB RTX 4090, RTX 3090, RTX 3090 Ti Primary recommendation
16 GB RTX 4080, RTX 3080 Ti Secondary; need more nodes for large models
12 GB RTX 4070 Ti, RTX 3080 Only viable for small models (≤13B)
8 GB RTX 4070, RTX 3070 Development only; most models won't fit

Memory Bandwidth (affects decode speed)

Decode throughput is memory-bandwidth-limited, not compute-limited. Higher memory bandwidth = faster tokens.

GPU Memory Bandwidth Relative decode speed
RTX 4090 1,008 GB/s 1.0× (baseline)
RTX 3090 936 GB/s 0.93×
RTX 4080 717 GB/s 0.71×
RTX 3080 760 GB/s 0.75×

For latency-sensitive applications, put high-bandwidth GPUs in the last pipeline stage (where decode sampling happens).


Kubernetes GPU Node Labelling

For automatic DaemonSet scheduling, label GPU nodes before deploying:

# Label a node as a PanGalactic GPU worker
kubectl label node <node-name> pangalactic.io/gpu=true

# Optional: label with GPU model for affinity rules
kubectl label node <node-name> pangalactic.io/gpu-model=rtx4090

# Verify
kubectl get nodes -l pangalactic.io/gpu=true

The DaemonSet in deploy/kubernetes/node/daemonset.yaml uses nodeSelector: pangalactic.io/gpu: "true" to ensure only labelled nodes run the Node agent.


Network Requirements

Link Required Recommended
Node ↔ Node (StarStream) ≥ 1 Gbps 10 Gbps
Node ↔ Hub (gRPC) ≥ 100 Mbps 1 Gbps
Gateway ↔ Hub (gRPC) ≥ 100 Mbps 1 Gbps

At 1 Gbps, Nebula automatically switches all links to INT8 compression (halving transfer size). Prefill hop latency becomes ~12.8 ms per hop instead of ~6.4 ms, which is acceptable for most use cases.

At 100 Mbps, the system still works but prefill latency increases to ~64 ms per hop. Only suitable for very short prompts or offline batch inference.


See Also