Quickstart

Stand up a local PanGalactic cluster in under 5 minutes using Docker Compose. No GPU required — the dev stack uses simulated VRAM and a mock inference engine.

Prerequisites

Docker 24+ and Docker Compose V2
4 GB free disk space (for the TinyLlama model)
Ports 8080, 7100, 7200–7205, 9090, 3000 available

Step 1: Start the cluster

cd deploy/docker
docker compose up -d

This starts:

1 Hub on port 7100
3 CPU-mode nodes on ports 7200, 7202, 7204
1 Gateway on port 8080
Prometheus on port 9090
Grafana on port 3000

Wait ~15 seconds for all services to become healthy:

docker compose ps

All services should show healthy or running.

Step 2: Download TinyLlama

pangalactic model download tinyllama-1.1b

This downloads the Q4_K_M GGUF file (~670 MB) to /data/models/ on the model-data volume.

To verify the download:

pangalactic model list

Step 3: Check cluster topology

pangalactic cluster topology

Expected output:

Pipeline: dev-pipeline-tinyllama-1.1b
  Model:   tinyllama-1.1b
  Status:  HEALTHY
  Stage 0: node-0  layers 0-10   [first]
  Stage 1: node-1  layers 11-21
  Stage 2: node-2  layers 22-21  [last]

Step 4: Send your first request

Non-streaming:

curl http://localhost:8080/v1/chat/completions \
  -H "Authorization: Bearer dev-key-not-for-production" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "tinyllama-1.1b",
    "messages": [{"role": "user", "content": "Hello! What can you do?"}],
    "max_tokens": 100
  }'

Streaming (SSE):

curl http://localhost:8080/v1/chat/completions \
  -H "Authorization: Bearer dev-key-not-for-production" \
  -H "Content-Type: application/json" \
  --no-buffer \
  -d '{
    "model": "tinyllama-1.1b",
    "messages": [{"role": "user", "content": "Tell me a joke."}],
    "max_tokens": 150,
    "stream": true
  }'

Step 5: Verify monitoring

Open Grafana at http://localhost:3000 (admin / pangalactic).

The default dashboard shows:

Token generation rate (tokens/sec)
Node VRAM usage
Active in-flight requests
Pipeline hop latency

Raw Prometheus metrics: http://localhost:9090

Step 6: Test failover (optional)

Kill node-1 while a request is streaming:

# Terminal 1: start a streaming request
curl http://localhost:8080/v1/chat/completions \
  -H "Authorization: Bearer dev-key-not-for-production" \
  -H "Content-Type: application/json" \
  --no-buffer \
  -d '{"model":"tinyllama-1.1b","messages":[{"role":"user","content":"Count from 1 to 100 slowly."}],"stream":true}' &

# Terminal 2: kill node-1 after a few tokens appear
docker compose kill node-1

Supernova failover will detect the failure within 15 seconds, reassign node-1's shards to node-0 or node-2, and the streaming request will resume.

Tear Down

docker compose down          # Stop and remove containers
docker compose down -v       # Also remove volumes (clears downloaded models)

Using an Existing Client Library

Because PanGalactic speaks the OpenAI API, existing libraries work out of the box:

Python (openai-python):

from openai import OpenAI

client = OpenAI(
    api_key="dev-key-not-for-production",
    base_url="http://localhost:8080/v1",
)

response = client.chat.completions.create(
    model="tinyllama-1.1b",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)

LangChain:

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    model="tinyllama-1.1b",
    openai_api_key="dev-key-not-for-production",
    openai_api_base="http://localhost:8080/v1",
)
print(llm.invoke("Hello!"))

Next Steps

Node Sizing — run real models on real GPUs
Production Deployment — Kubernetes + etcd HA
Troubleshooting — if something went wrong