Quickstart
Stand up a local PanGalactic cluster in under 5 minutes using Docker Compose. No GPU required — the dev stack uses simulated VRAM and a mock inference engine.
Prerequisites
- Docker 24+ and Docker Compose V2
- 4 GB free disk space (for the TinyLlama model)
- Ports 8080, 7100, 7200–7205, 9090, 3000 available
Step 1: Start the cluster
cd deploy/docker
docker compose up -d
This starts:
- 1 Hub on port 7100
- 3 CPU-mode nodes on ports 7200, 7202, 7204
- 1 Gateway on port 8080
- Prometheus on port 9090
- Grafana on port 3000
Wait ~15 seconds for all services to become healthy:
docker compose ps
All services should show healthy or running.
Step 2: Download TinyLlama
pangalactic model download tinyllama-1.1b
This downloads the Q4_K_M GGUF file (~670 MB) to /data/models/ on the model-data volume.
To verify the download:
pangalactic model list
Step 3: Check cluster topology
pangalactic cluster topology
Expected output:
Pipeline: dev-pipeline-tinyllama-1.1b
Model: tinyllama-1.1b
Status: HEALTHY
Stage 0: node-0 layers 0-10 [first]
Stage 1: node-1 layers 11-21
Stage 2: node-2 layers 22-21 [last]
Step 4: Send your first request
Non-streaming:
curl http://localhost:8080/v1/chat/completions \
-H "Authorization: Bearer dev-key-not-for-production" \
-H "Content-Type: application/json" \
-d '{
"model": "tinyllama-1.1b",
"messages": [{"role": "user", "content": "Hello! What can you do?"}],
"max_tokens": 100
}'
Streaming (SSE):
curl http://localhost:8080/v1/chat/completions \
-H "Authorization: Bearer dev-key-not-for-production" \
-H "Content-Type: application/json" \
--no-buffer \
-d '{
"model": "tinyllama-1.1b",
"messages": [{"role": "user", "content": "Tell me a joke."}],
"max_tokens": 150,
"stream": true
}'
Step 5: Verify monitoring
Open Grafana at http://localhost:3000 (admin / pangalactic).
The default dashboard shows:
- Token generation rate (tokens/sec)
- Node VRAM usage
- Active in-flight requests
- Pipeline hop latency
Raw Prometheus metrics: http://localhost:9090
Step 6: Test failover (optional)
Kill node-1 while a request is streaming:
# Terminal 1: start a streaming request
curl http://localhost:8080/v1/chat/completions \
-H "Authorization: Bearer dev-key-not-for-production" \
-H "Content-Type: application/json" \
--no-buffer \
-d '{"model":"tinyllama-1.1b","messages":[{"role":"user","content":"Count from 1 to 100 slowly."}],"stream":true}' &
# Terminal 2: kill node-1 after a few tokens appear
docker compose kill node-1
Supernova failover will detect the failure within 15 seconds, reassign node-1's shards to node-0 or node-2, and the streaming request will resume.
Tear Down
docker compose down # Stop and remove containers
docker compose down -v # Also remove volumes (clears downloaded models)
Using an Existing Client Library
Because PanGalactic speaks the OpenAI API, existing libraries work out of the box:
Python (openai-python):
from openai import OpenAI
client = OpenAI(
api_key="dev-key-not-for-production",
base_url="http://localhost:8080/v1",
)
response = client.chat.completions.create(
model="tinyllama-1.1b",
messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)
LangChain:
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(
model="tinyllama-1.1b",
openai_api_key="dev-key-not-for-production",
openai_api_base="http://localhost:8080/v1",
)
print(llm.invoke("Hello!"))
Next Steps
- Node Sizing — run real models on real GPUs
- Production Deployment — Kubernetes + etcd HA
- Troubleshooting — if something went wrong