Production Deployment
This guide covers deploying PanGalactic to a Kubernetes cluster with a highly available Hub backed by etcd.
Architecture Overview
[ Load Balancer ]
|
┌───────────┴───────────┐
Gateway Pod Gateway Pod (replicas: 2+)
└───────────┬───────────┘
│ gRPC
[ Hub Pod ] (replicas: 1, etcd-backed)
│ gRPC
┌──────────────┼──────────────┐
Node DaemonSet Node DaemonSet Node DaemonSet (one per GPU node)
│
[ etcd StatefulSet ] (3 replicas for HA)
Prerequisites
- Kubernetes 1.28+
kubectlconfigured for your cluster- NVIDIA GPU operator installed (for CUDA nodes)
- Persistent storage class available (for Hub state)
Step 1: Create the Namespace
kubectl create namespace pangalactic
Step 2: Deploy etcd
For production, the Hub uses etcd instead of SQLite. The etcd StatefulSet provides consensus and survives Hub Pod restarts.
kubectl apply -f deploy/kubernetes/etcd/statefulset.yaml -n pangalactic
kubectl rollout status statefulset/etcd -n pangalactic
Verify etcd is healthy:
kubectl exec -it etcd-0 -n pangalactic -- etcdctl endpoint health
Step 3: Configure Secrets
Create the Gateway API key secret:
kubectl create secret generic pangalactic-gateway-secret \
--from-literal=api-key="$(openssl rand -hex 32)" \
-n pangalactic
Store the key securely — you will need it for all client requests.
Step 4: Deploy the Hub
kubectl apply -f deploy/kubernetes/hub/deployment.yaml -n pangalactic
kubectl rollout status deployment/pangalactic-hub -n pangalactic
Set the Hub to use etcd:
kubectl set env deployment/pangalactic-hub \
HUB_STATE_BACKEND=etcd \
HUB_ETCD_ENDPOINTS=http://etcd-0.etcd:2379,http://etcd-1.etcd:2379,http://etcd-2.etcd:2379 \
-n pangalactic
Step 5: Label GPU Nodes
For each node that has a GPU:
kubectl label node <node-name> pangalactic.io/gpu=true
For CUDA nodes specifically (the default Node image requires CUDA):
kubectl label node <node-name> pangalactic.io/gpu=true
# The NVIDIA GPU operator handles nvidia.com/gpu resource allocation automatically
For ROCm nodes, use the ROCm image by editing the DaemonSet:
kubectl patch daemonset pangalactic-node -n pangalactic \
-p '{"spec":{"template":{"spec":{"containers":[{"name":"node","image":"pangalactic/node-rocm:latest"}]}}}}'
Step 6: Deploy the Node DaemonSet
kubectl apply -f deploy/kubernetes/node/daemonset.yaml -n pangalactic
The DaemonSet will schedule one Node Pod per labelled GPU node. Verify:
kubectl get pods -n pangalactic -l app=pangalactic-node
Each pod should reach Running status. Initial startup may take 30–60 seconds as the node registers with the Hub and awaits shard assignment.
Step 7: Deploy the Gateway
kubectl apply -f deploy/kubernetes/gateway/deployment.yaml -n pangalactic
kubectl rollout status deployment/pangalactic-gateway -n pangalactic
The Gateway Deployment includes an HPA that scales replicas 2–10 based on CPU utilization (70% target).
Step 8: Deploy Monitoring
kubectl apply -f deploy/kubernetes/monitoring/prometheus.yaml -n pangalactic
For Grafana, install the Grafana Helm chart pointed at the Prometheus service:
helm repo add grafana https://grafana.github.io/helm-charts
helm install grafana grafana/grafana \
--namespace pangalactic \
--set adminPassword=changeme \
--set datasources."datasources\.yaml".apiVersion=1 \
--set datasources."datasources\.yaml".datasources[0].name=Prometheus \
--set datasources."datasources\.yaml".datasources[0].type=prometheus \
--set datasources."datasources\.yaml".datasources[0].url=http://prometheus:9090 \
--set datasources."datasources\.yaml".datasources[0].isDefault=true
Step 9: Load a Model
Get the Gateway service external IP:
kubectl get service pangalactic-gateway -n pangalactic
# NAME TYPE EXTERNAL-IP PORT(S)
# pangalactic-gateway LoadBalancer 203.0.113.42 80:32000/TCP
Download and load a model (run from a node with volume access, or use a Job):
export GATEWAY_URL=http://203.0.113.42
export API_KEY=<key from Step 3>
pangalactic model download tinyllama-1.1b --output-dir /data/models
pangalactic cluster topology --hub-addr <hub-cluster-ip>:7100
Verification
curl http://<GATEWAY_IP>/v1/chat/completions \
-H "Authorization: Bearer $API_KEY" \
-H "Content-Type: application/json" \
-d '{"model":"tinyllama-1.1b","messages":[{"role":"user","content":"Hello"}]}'
Check readiness:
curl http://<GATEWAY_IP>/readyz
# {"ready": true}
Upgrading
Rolling update of the Gateway (zero-downtime):
kubectl set image deployment/pangalactic-gateway gateway=pangalactic/gateway:v0.2.0 -n pangalactic
kubectl rollout status deployment/pangalactic-gateway -n pangalactic
Upgrading a Node image causes the DaemonSet to roll out one node at a time. In-flight requests on each node drain before the pod is terminated (uses terminationGracePeriodSeconds: 120).
The Hub upgrade requires a brief downtime (single replica). Schedule during low-traffic windows and ensure etcd has a recent snapshot.
Configuration Reference
All configuration is via environment variables. Key values:
| Variable | Component | Default | Description |
|---|---|---|---|
HUB_BIND_ADDR |
Hub | 0.0.0.0:7100 |
gRPC listen address |
HUB_STATE_BACKEND |
Hub | sqlite |
sqlite or etcd |
HUB_ETCD_ENDPOINTS |
Hub | http://localhost:2379 |
Comma-separated etcd endpoints |
HUB_HEARTBEAT_TIMEOUT_S |
Hub | 15 |
Seconds before node marked FAILED |
NODE_HUB_ADDR |
Node | localhost:7100 |
Hub gRPC address |
NODE_GPU_COMPUTE_TYPE |
Node | cuda |
cuda, rocm, metal, cpu |
NODE_SIMULATED_VRAM_GB |
Node | 0 |
Non-zero enables simulated GPU (dev) |
NODE_INFERENCE_BACKEND |
Node | transformers |
transformers, mock |
GATEWAY_BIND_ADDR |
Gateway | 0.0.0.0:8080 |
HTTP listen address |
GATEWAY_HUB_ADDR |
Gateway | localhost:7100 |
Hub gRPC address |
GATEWAY_API_KEY |
Gateway | `` (no auth) | Bearer token for API authentication |
GATEWAY_RATE_LIMIT_RPM |
Gateway | 0 (disabled) |
Requests per minute per key |
See Also
- Node Sizing — GPU requirements and labelling
- Troubleshooting — common deployment issues
- Quickstart — local dev stack