Node Sizing

This guide helps you determine how many GPUs you need, how the VRAM budget is allocated, and how shards are distributed across nodes.

VRAM Requirements by Model

All figures assume Q4_K_M quantization (~4.5 bits per weight). VRAM shown is the minimum required after applying the 85% fill factor. Practical recommendation: add 20% headroom for KV cache growth under concurrent load.

Model	Parameters	Weights (Q4_K_M)	Min nodes × 24 GB	Min nodes × 16 GB
TinyLlama 1.1B	1.1B	0.7 GB	1	1
Llama 3 8B	8B	4.9 GB	1	1
Mistral 7B	7B	4.4 GB	1	1
Llama 3 70B	70B	42 GB	2	3
Mixtral 8×7B (MoE)	46.7B	28 GB	2	2
Llama 3 405B	405B	243 GB	11	16

How to Calculate

weights_gb  = parameters × 0.55  (Q4_K_M ≈ 4.5 bits per weight)
usable_vram = total_vram_gb × 0.85  (VRAM_FILL_FACTOR, 15% reserved for KV + overhead)
min_nodes   = ceil(weights_gb / usable_vram_per_node)

For Llama 3 70B across 2× RTX 4090 (24 GB each):

weights_gb  = 70 × 0.55 = 38.5 GB
usable_vram = 2 × 24 × 0.85 = 40.8 GB
38.5 GB < 40.8 GB  ✓  (fits with 2 nodes, 2.3 GB headroom)

VRAM_FILL_FACTOR Explained

The Hub reserves 15% of each node's VRAM for:

KV cache pages — PagedAttention blocks for in-flight requests
Framework overhead — PyTorch/llama-cpp runtime, CUDA context
Headroom buffer — prevents OOM during burst load

For a 24 GB card: 24 × 0.85 = 20.4 GB usable. The remaining 3.6 GB is the KV cache budget.

KV cache capacity at default settings (block_size=16, max_blocks=2048):

KV per block ≈ 16 tokens × num_kv_heads × head_dim × 2 layers × 2 (K+V) × 2 bytes (FP16)
            ≈ 16 × 8 × 128 × (layers_on_node / total_layers) × 2 × 2

For Llama 3 70B with 20 layers on a node: ~65 KB per block, 2048 blocks = 133 MB.

This accommodates ~2000 concurrent requests at average 16 tokens of KV history — more than sufficient for most workloads.

Heterogeneous Nodes

PanGalactic supports mixing different GPU models in the same cluster. The shard assignment algorithm allocates layers proportionally to available VRAM:

node_0 (RTX 4090, 24 GB): fraction = 24 / (24 + 16 + 16) = 43%  → ~34 of 80 layers
node_1 (RTX 3060, 16 GB): fraction = 16 / 56 = 29%               → ~23 layers
node_2 (RTX 3060, 16 GB): fraction = 16 / 56 = 29%               → ~23 layers

Faster GPUs naturally get more layers and more compute work. This is correct behaviour — a bottleneck at a slow node would stall the entire pipeline.

GPU Type Recommendations

VRAM (most important factor)

VRAM	GPU options	Best for
24 GB	RTX 4090, RTX 3090, RTX 3090 Ti	Primary recommendation
16 GB	RTX 4080, RTX 3080 Ti	Secondary; need more nodes for large models
12 GB	RTX 4070 Ti, RTX 3080	Only viable for small models (≤13B)
8 GB	RTX 4070, RTX 3070	Development only; most models won't fit

Memory Bandwidth (affects decode speed)

Decode throughput is memory-bandwidth-limited, not compute-limited. Higher memory bandwidth = faster tokens.

GPU	Memory Bandwidth	Relative decode speed
RTX 4090	1,008 GB/s	1.0× (baseline)
RTX 3090	936 GB/s	0.93×
RTX 4080	717 GB/s	0.71×
RTX 3080	760 GB/s	0.75×

For latency-sensitive applications, put high-bandwidth GPUs in the last pipeline stage (where decode sampling happens).

Kubernetes GPU Node Labelling

For automatic DaemonSet scheduling, label GPU nodes before deploying:

# Label a node as a PanGalactic GPU worker
kubectl label node <node-name> pangalactic.io/gpu=true

# Optional: label with GPU model for affinity rules
kubectl label node <node-name> pangalactic.io/gpu-model=rtx4090

# Verify
kubectl get nodes -l pangalactic.io/gpu=true

The DaemonSet in deploy/kubernetes/node/daemonset.yaml uses nodeSelector: pangalactic.io/gpu: "true" to ensure only labelled nodes run the Node agent.

Network Requirements

Link	Required	Recommended
Node ↔ Node (StarStream)	≥ 1 Gbps	10 Gbps
Node ↔ Hub (gRPC)	≥ 100 Mbps	1 Gbps
Gateway ↔ Hub (gRPC)	≥ 100 Mbps	1 Gbps

At 1 Gbps, Nebula automatically switches all links to INT8 compression (halving transfer size). Prefill hop latency becomes ~12.8 ms per hop instead of ~6.4 ms, which is acceptable for most use cases.

At 100 Mbps, the system still works but prefill latency increases to ~64 ms per hop. Only suitable for very short prompts or offline batch inference.