PanGalactic — Distributed GPU Inference

Core Innovations

Built Different

Six novel systems not found in any existing distributed inference framework.

Nebula Protocol

Per-link adaptive compression. Automatically negotiates INT8 vs FP16 tensor transfer based on measured bandwidth, cutting wire usage by 50% on slower links with less than 0.5% accuracy impact.

~50% bandwidth savings

Star Formation

Zero-config node discovery using mDNS on LAN with a Kademlia DHT fallback for multi-subnet clusters. A new node joins with a single command — no IPs, no config files.

Zero-config clustering

Supernova Failover

Node failure triggers automatic shard reassignment within seconds. In-flight requests checkpoint at pipeline stage boundaries and retry transparently. The client never sees an error.

Sub-15s recovery

Quasar Routing

Requests are classified on arrival — long prompts route to compute-heavy prefill pipelines, streaming requests go to low-latency decode pipelines, priority requests skip the queue entirely.

Intelligent routing

Pulsar Scheduling

Exponential moving average detects request rate spikes 30 seconds before they peak. Dormant nodes are pre-warmed automatically, eliminating cold-start latency when traffic surges arrive.

Predictive pre-warming

Hot-Swap Shards

Upgrade a model shard on a live node without dropping in-flight requests. The manager drains active requests, loads the new weights, and resumes — zero downtime model updates.

Zero-downtime updates

Architecture

How It Works

Three processes. Two network planes. One OpenAI-compatible API.

Gateway

FastAPI · OpenAI API

gRPC

Hub

Orchestrator · Router

gRPC

Node 0

Layers 0–19 · RTX 4090

StarStream

Node 1

Layers 20–39 · RTX 3090

StarStream

Node 2

Layers 40–59 · RTX 4090

Prefill — prompt tokens embed → attention × N layers → hidden states flow node-to-node via StarStream (8 MB/hop @ ~6 ms on 10 Gbps)

Decode — one token per step, KV cache reused → only 64 KB/hop (negligible) → TOKEN_READY frames stream back to client as SSE

Interactive

Try It Live

Connect to your PanGalactic cluster and stream tokens in real time.

Connection

API Endpoint

API Key

Model

Not connected

Max Tokens

Temperature 0.7

⬡

Configure your endpoint and click Connect, then start chatting.

Streaming via SSE · Pipeline parallelism · OpenAI-compatible

Performance

Numbers That Matter

Measured on TinyLlama 1.1B across 2 nodes with 10 Gbps interconnect.

0 tok/s

Decode throughput

RTX 4090 × 2 nodes

0 ms

Prefill hop latency

8 MB tensor @ 10 Gbps

0 ms

Decode hop latency

64 KB tensor (1 token)

<0 s

Supernova failover

Node death → shard reassigned

0%

VRAM utilization target

15% reserved for KV cache

0×

INT8 bandwidth saving

Nebula on <5 Gbps links

Network overhead is less than 1% of total request latency during decode. Prefill overhead is typically 0.3–1.3% for 512-token prompts on Llama 70B.

Get Started

Up in 5 Minutes

No GPU required for the dev stack. Simulated VRAM, mock inference engine.

# Start the full stack (Hub + 3 nodes + Gateway + Prometheus + Grafana)
cd deploy/docker
docker compose up -d

# Download TinyLlama (~670 MB GGUF)
pangalactic model download tinyllama-1.1b

# Check cluster topology
pangalactic cluster topology

# Stream tokens
curl http://localhost:8080/v1/chat/completions \
  -H "Authorization: Bearer dev-key-not-for-production" \
  -H "Content-Type: application/json" \
  --no-buffer \
  -d '{
    "model": "tinyllama-1.1b",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": true
  }'

# Non-streaming completion
curl http://localhost:8080/v1/chat/completions \
  -H "Authorization: Bearer $PANGALACTIC_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "tinyllama-1.1b",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is pipeline parallelism?"}
    ],
    "max_tokens": 200,
    "temperature": 0.7
  }'

# Streaming (SSE)
curl http://localhost:8080/v1/chat/completions \
  -H "Authorization: Bearer $PANGALACTIC_API_KEY" \
  -H "Content-Type: application/json" \
  --no-buffer \
  -d '{"model":"tinyllama-1.1b","messages":[{"role":"user","content":"Hi"}],"stream":true}'

# List available models
curl http://localhost:8080/v1/models \
  -H "Authorization: Bearer $PANGALACTIC_API_KEY"

from openai import OpenAI

# PanGalactic is fully OpenAI-compatible
client = OpenAI(
    api_key="dev-key-not-for-production",
    base_url="http://localhost:8080/v1",
)

# Non-streaming
response = client.chat.completions.create(
    model="tinyllama-1.1b",
    messages=[{"role": "user", "content": "Explain pipeline parallelism."}],
    max_tokens=200,
)
print(response.choices[0].message.content)

# Streaming
stream = client.chat.completions.create(
    model="tinyllama-1.1b",
    messages=[{"role": "user", "content": "Tell me a joke."}],
    stream=True,
)
for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

# Create namespace
kubectl create namespace pangalactic

# Label your GPU nodes
kubectl label node <node-name> pangalactic.io/gpu=true

# Deploy Hub + Gateway + Node DaemonSet
kubectl apply -f deploy/kubernetes/hub/deployment.yaml -n pangalactic
kubectl apply -f deploy/kubernetes/gateway/deployment.yaml -n pangalactic
kubectl apply -f deploy/kubernetes/node/daemonset.yaml -n pangalactic

# Apply the HTTPRoute (Envoy Gateway)
kubectl apply -f deploy/kubernetes/gateway/httproute.yaml -n pangalactic

# Check cluster health
pangalactic cluster status --hub-addr <hub-service-ip>:7100

Quickstart Guide Production Deployment Architecture Docs Node Sizing Guide