System Overview

PanGalactic turns a collection of consumer GPUs connected by ordinary Ethernet into a single logical inference engine capable of running models that exceed the VRAM of any individual card.


The Problem

A Llama 3 70B model at Q4_K_M quantization requires approximately 35 GB of VRAM. A single RTX 4090 has 24 GB. The model simply does not fit.

Cloud GPU instances solve this with specialized interconnects (NVLink, InfiniBand), but these cost 10–50× more per hour than consumer cards. PanGalactic's goal is to achieve the same result using off-the-shelf hardware and a standard 10 Gbps switch.


The Three-Process Model

Every PanGalactic deployment consists of exactly three kinds of process:

┌──────────────────────────────────────────────────────────────────┐
│                       GATEWAY (FastAPI)                          │
│          OpenAI-compatible REST + SSE streaming API              │
│          POST /v1/chat/completions   GET /v1/models   /health    │
└──────────────────────────┬───────────────────────────────────────┘
                           │  gRPC: GetRoutePlan
                           ▼
┌──────────────────────────────────────────────────────────────────┐
│                        HUB (Orchestrator)                        │
│    Shard assignment  ·  Quasar routing  ·  Supernova failover    │
│    Pulsar scheduling  ·  Model registry  ·  Cluster topology     │
└──────┬───────────────────────────────────────────┬───────────────┘
       │  gRPC: AssignShard / Heartbeat            │  ZMQ telemetry
       ▼                                           ▼
┌────────────┐  StarStream  ┌──────────────┐  StarStream  ┌──────────────┐
│   NODE 0   │ ───────────► │    NODE 1    │ ───────────► │    NODE 2    │
│ RTX 4090   │  FP16/INT8   │   RTX 3090   │  FP16/INT8   │   RTX 4090   │
│ Layers 0–19│  ~8 MB/hop   │ Layers 20–39 │  ~8 MB/hop   │ Layers 40–59 │
│ + Embed    │  ~6.4 ms     │              │              │ + lm_head    │
└────────────┘              └──────────────┘              └─────┬────────┘
                                                                │
                                                 TOKEN_READY → Gateway → SSE

Gateway

The user-facing component. Accepts OpenAI-compatible HTTP requests, calls the Hub to get a routing plan, dispatches the prompt to the first pipeline node via StarStream, and streams tokens back to the client as Server-Sent Events (SSE).

Runs as a stateless FastAPI application. Multiple Gateway replicas can sit behind a load balancer.

Hub

The control plane. Responsible for:

  • Tracking which nodes are alive (via heartbeat)
  • Assigning model shards to nodes (greedy bin-pack algorithm)
  • Routing requests to the right pipeline (Quasar)
  • Detecting node failures and reassigning shards (Supernova)
  • Predicting load spikes and pre-warming nodes (Pulsar)

There is one Hub per cluster. In production it can be backed by etcd for HA.

Node

The compute worker. Each Node runs on one GPU and:

  • Loads one or more consecutive transformer layers from a GGUF file
  • Receives hidden-state tensors from the previous node via StarStream
  • Runs the local forward pass (attention + FFN for its assigned layers)
  • Sends the output tensor to the next node (or returns the final token to the Gateway)

Nodes register with the Hub on startup and receive shard assignments via gRPC.


Two Network Planes

Plane Protocol Purpose
Control gRPC (TCP) Hub↔Node: heartbeats, shard assignments, status
Data StarStream (custom binary, TCP) Node↔Node: tensor transfers during inference

The planes are separated so a spike in tensor traffic (data plane) cannot starve control messages. Telemetry uses ZeroMQ PUB/SUB as a third, fully decoupled channel.


Token Generation Flow

Client POST /v1/chat/completions
  ↓
Gateway authenticates + classifies request (Quasar: prefill_heavy / streaming / standard)
  ↓
gRPC GetRoutePlan → Hub returns PipelinePlan (ordered stages, compression hints)
  ↓
PREFILL — compute-bound:
  Node0: tokenize + embed + layers 0–19 → hidden_states [1, seq_len, 8192]
  StarStream (8 MB FP16 or 4 MB INT8) →
  Node1: layers 20–39 → hidden_states →
  Node2: layers 40–59 + lm_head → sample first token
  ↓
DECODE — memory-bandwidth-bound (one token at a time):
  Node0: embed(1 token) → [1, 1, 8192]
  StarStream (64 KB — KV cache hit, only 1 new token) →
  Node1 → Node2: sample next token
  Token sent as TOKEN_READY frame → Gateway → SSE chunk to client
  Repeat until EOS or max_tokens

Design Philosophy

Pipeline parallelism only. Tensor parallelism (splitting a single layer across multiple GPUs) requires NVLink bandwidth (~600 GB/s). A 10 Gbps link is 60,000× slower. Splitting layers across the network would cost ~50 ms per attention head synchronization — unusable. By keeping each layer on one GPU and splitting layers across GPUs, communication happens only once per transformer block boundary.

Consumer hardware, production reliability. The entire fault-tolerance model assumes GPUs will fail, disconnect, and rejoin. Supernova failover, hot-swap shards, and the heartbeat health monitor exist precisely because consumer hardware is not enterprise-reliable.

Zero mandatory configuration. A node can join a LAN cluster by running a single command. Star Formation (mDNS + DHT) handles discovery automatically.

OpenAI-compatible by default. Existing client libraries (LangChain, llama-index, openai-python) work without modification.


Further Reading