Open Source · Consumer Hardware · Zero Config

PanGalactic

Run 70B models on RTX cards. Together.

Pipeline-parallel LLM inference across your consumer GPU cluster over standard 10 Gbps Ethernet. OpenAI-compatible API. Self-organizing cluster.

24 GB × N nodes VRAM scales linearly
70B+ Parameters per cluster
10 Gbps Standard Ethernet only
Core Innovations

Built Different

Six novel systems not found in any existing distributed inference framework.

Nebula Protocol

Per-link adaptive compression. Automatically negotiates INT8 vs FP16 tensor transfer based on measured bandwidth, cutting wire usage by 50% on slower links with less than 0.5% accuracy impact.

~50% bandwidth savings

Star Formation

Zero-config node discovery using mDNS on LAN with a Kademlia DHT fallback for multi-subnet clusters. A new node joins with a single command — no IPs, no config files.

Zero-config clustering

Supernova Failover

Node failure triggers automatic shard reassignment within seconds. In-flight requests checkpoint at pipeline stage boundaries and retry transparently. The client never sees an error.

Sub-15s recovery

Quasar Routing

Requests are classified on arrival — long prompts route to compute-heavy prefill pipelines, streaming requests go to low-latency decode pipelines, priority requests skip the queue entirely.

Intelligent routing

Pulsar Scheduling

Exponential moving average detects request rate spikes 30 seconds before they peak. Dormant nodes are pre-warmed automatically, eliminating cold-start latency when traffic surges arrive.

Predictive pre-warming

Hot-Swap Shards

Upgrade a model shard on a live node without dropping in-flight requests. The manager drains active requests, loads the new weights, and resumes — zero downtime model updates.

Zero-downtime updates
Architecture

How It Works

Three processes. Two network planes. One OpenAI-compatible API.

Gateway
FastAPI · OpenAI API
gRPC
Hub
Orchestrator · Router
gRPC
Node 0
Layers 0–19 · RTX 4090
StarStream
Node 1
Layers 20–39 · RTX 3090
StarStream
Node 2
Layers 40–59 · RTX 4090
Prefill — prompt tokens embed → attention × N layers → hidden states flow node-to-node via StarStream (8 MB/hop @ ~6 ms on 10 Gbps)
Decode — one token per step, KV cache reused → only 64 KB/hop (negligible) → TOKEN_READY frames stream back to client as SSE
Interactive

Try It Live

Connect to your PanGalactic cluster and stream tokens in real time.

Connection

Not connected
0.7

Configure your endpoint and click Connect, then start chatting.

Streaming via SSE · Pipeline parallelism · OpenAI-compatible
Performance

Numbers That Matter

Measured on TinyLlama 1.1B across 2 nodes with 10 Gbps interconnect.

0 tok/s
Decode throughput
RTX 4090 × 2 nodes
0 ms
Prefill hop latency
8 MB tensor @ 10 Gbps
0 ms
Decode hop latency
64 KB tensor (1 token)
<0 s
Supernova failover
Node death → shard reassigned
0%
VRAM utilization target
15% reserved for KV cache
0×
INT8 bandwidth saving
Nebula on <5 Gbps links
Network overhead is less than 1% of total request latency during decode. Prefill overhead is typically 0.3–1.3% for 512-token prompts on Llama 70B.
Get Started

Up in 5 Minutes

No GPU required for the dev stack. Simulated VRAM, mock inference engine.

# Start the full stack (Hub + 3 nodes + Gateway + Prometheus + Grafana)
cd deploy/docker
docker compose up -d

# Download TinyLlama (~670 MB GGUF)
pangalactic model download tinyllama-1.1b

# Check cluster topology
pangalactic cluster topology

# Stream tokens
curl http://localhost:8080/v1/chat/completions \
  -H "Authorization: Bearer dev-key-not-for-production" \
  -H "Content-Type: application/json" \
  --no-buffer \
  -d '{
    "model": "tinyllama-1.1b",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": true
  }'
# Non-streaming completion
curl http://localhost:8080/v1/chat/completions \
  -H "Authorization: Bearer $PANGALACTIC_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "tinyllama-1.1b",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is pipeline parallelism?"}
    ],
    "max_tokens": 200,
    "temperature": 0.7
  }'

# Streaming (SSE)
curl http://localhost:8080/v1/chat/completions \
  -H "Authorization: Bearer $PANGALACTIC_API_KEY" \
  -H "Content-Type: application/json" \
  --no-buffer \
  -d '{"model":"tinyllama-1.1b","messages":[{"role":"user","content":"Hi"}],"stream":true}'

# List available models
curl http://localhost:8080/v1/models \
  -H "Authorization: Bearer $PANGALACTIC_API_KEY"
from openai import OpenAI

# PanGalactic is fully OpenAI-compatible
client = OpenAI(
    api_key="dev-key-not-for-production",
    base_url="http://localhost:8080/v1",
)

# Non-streaming
response = client.chat.completions.create(
    model="tinyllama-1.1b",
    messages=[{"role": "user", "content": "Explain pipeline parallelism."}],
    max_tokens=200,
)
print(response.choices[0].message.content)

# Streaming
stream = client.chat.completions.create(
    model="tinyllama-1.1b",
    messages=[{"role": "user", "content": "Tell me a joke."}],
    stream=True,
)
for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
# Create namespace
kubectl create namespace pangalactic

# Label your GPU nodes
kubectl label node <node-name> pangalactic.io/gpu=true

# Deploy Hub + Gateway + Node DaemonSet
kubectl apply -f deploy/kubernetes/hub/deployment.yaml -n pangalactic
kubectl apply -f deploy/kubernetes/gateway/deployment.yaml -n pangalactic
kubectl apply -f deploy/kubernetes/node/daemonset.yaml -n pangalactic

# Apply the HTTPRoute (Envoy Gateway)
kubectl apply -f deploy/kubernetes/gateway/httproute.yaml -n pangalactic

# Check cluster health
pangalactic cluster status --hub-addr <hub-service-ip>:7100