Run 70B models on RTX cards. Together.
Pipeline-parallel LLM inference across your consumer GPU cluster over standard 10 Gbps Ethernet. OpenAI-compatible API. Self-organizing cluster.
Six novel systems not found in any existing distributed inference framework.
Per-link adaptive compression. Automatically negotiates INT8 vs FP16 tensor transfer based on measured bandwidth, cutting wire usage by 50% on slower links with less than 0.5% accuracy impact.
Zero-config node discovery using mDNS on LAN with a Kademlia DHT fallback for multi-subnet clusters. A new node joins with a single command — no IPs, no config files.
Node failure triggers automatic shard reassignment within seconds. In-flight requests checkpoint at pipeline stage boundaries and retry transparently. The client never sees an error.
Requests are classified on arrival — long prompts route to compute-heavy prefill pipelines, streaming requests go to low-latency decode pipelines, priority requests skip the queue entirely.
Exponential moving average detects request rate spikes 30 seconds before they peak. Dormant nodes are pre-warmed automatically, eliminating cold-start latency when traffic surges arrive.
Upgrade a model shard on a live node without dropping in-flight requests. The manager drains active requests, loads the new weights, and resumes — zero downtime model updates.
Three processes. Two network planes. One OpenAI-compatible API.
Connect to your PanGalactic cluster and stream tokens in real time.
Measured on TinyLlama 1.1B across 2 nodes with 10 Gbps interconnect.
No GPU required for the dev stack. Simulated VRAM, mock inference engine.
# Start the full stack (Hub + 3 nodes + Gateway + Prometheus + Grafana)
cd deploy/docker
docker compose up -d
# Download TinyLlama (~670 MB GGUF)
pangalactic model download tinyllama-1.1b
# Check cluster topology
pangalactic cluster topology
# Stream tokens
curl http://localhost:8080/v1/chat/completions \
-H "Authorization: Bearer dev-key-not-for-production" \
-H "Content-Type: application/json" \
--no-buffer \
-d '{
"model": "tinyllama-1.1b",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": true
}'
# Non-streaming completion
curl http://localhost:8080/v1/chat/completions \
-H "Authorization: Bearer $PANGALACTIC_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "tinyllama-1.1b",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is pipeline parallelism?"}
],
"max_tokens": 200,
"temperature": 0.7
}'
# Streaming (SSE)
curl http://localhost:8080/v1/chat/completions \
-H "Authorization: Bearer $PANGALACTIC_API_KEY" \
-H "Content-Type: application/json" \
--no-buffer \
-d '{"model":"tinyllama-1.1b","messages":[{"role":"user","content":"Hi"}],"stream":true}'
# List available models
curl http://localhost:8080/v1/models \
-H "Authorization: Bearer $PANGALACTIC_API_KEY"
from openai import OpenAI
# PanGalactic is fully OpenAI-compatible
client = OpenAI(
api_key="dev-key-not-for-production",
base_url="http://localhost:8080/v1",
)
# Non-streaming
response = client.chat.completions.create(
model="tinyllama-1.1b",
messages=[{"role": "user", "content": "Explain pipeline parallelism."}],
max_tokens=200,
)
print(response.choices[0].message.content)
# Streaming
stream = client.chat.completions.create(
model="tinyllama-1.1b",
messages=[{"role": "user", "content": "Tell me a joke."}],
stream=True,
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
# Create namespace
kubectl create namespace pangalactic
# Label your GPU nodes
kubectl label node <node-name> pangalactic.io/gpu=true
# Deploy Hub + Gateway + Node DaemonSet
kubectl apply -f deploy/kubernetes/hub/deployment.yaml -n pangalactic
kubectl apply -f deploy/kubernetes/gateway/deployment.yaml -n pangalactic
kubectl apply -f deploy/kubernetes/node/daemonset.yaml -n pangalactic
# Apply the HTTPRoute (Envoy Gateway)
kubectl apply -f deploy/kubernetes/gateway/httproute.yaml -n pangalactic
# Check cluster health
pangalactic cluster status --hub-addr <hub-service-ip>:7100