Production Deployment

This guide covers deploying PanGalactic to a Kubernetes cluster with a highly available Hub backed by etcd.


Architecture Overview

                    [ Load Balancer ]
                          |
              ┌───────────┴───────────┐
       Gateway Pod               Gateway Pod    (replicas: 2+)
              └───────────┬───────────┘
                          │ gRPC
                    [ Hub Pod ]              (replicas: 1, etcd-backed)
                          │ gRPC
           ┌──────────────┼──────────────┐
      Node DaemonSet  Node DaemonSet  Node DaemonSet  (one per GPU node)
           │
      [ etcd StatefulSet ]  (3 replicas for HA)

Prerequisites

  • Kubernetes 1.28+
  • kubectl configured for your cluster
  • NVIDIA GPU operator installed (for CUDA nodes)
  • Persistent storage class available (for Hub state)

Step 1: Create the Namespace

kubectl create namespace pangalactic

Step 2: Deploy etcd

For production, the Hub uses etcd instead of SQLite. The etcd StatefulSet provides consensus and survives Hub Pod restarts.

kubectl apply -f deploy/kubernetes/etcd/statefulset.yaml -n pangalactic
kubectl rollout status statefulset/etcd -n pangalactic

Verify etcd is healthy:

kubectl exec -it etcd-0 -n pangalactic -- etcdctl endpoint health

Step 3: Configure Secrets

Create the Gateway API key secret:

kubectl create secret generic pangalactic-gateway-secret \
  --from-literal=api-key="$(openssl rand -hex 32)" \
  -n pangalactic

Store the key securely — you will need it for all client requests.


Step 4: Deploy the Hub

kubectl apply -f deploy/kubernetes/hub/deployment.yaml -n pangalactic
kubectl rollout status deployment/pangalactic-hub -n pangalactic

Set the Hub to use etcd:

kubectl set env deployment/pangalactic-hub \
  HUB_STATE_BACKEND=etcd \
  HUB_ETCD_ENDPOINTS=http://etcd-0.etcd:2379,http://etcd-1.etcd:2379,http://etcd-2.etcd:2379 \
  -n pangalactic

Step 5: Label GPU Nodes

For each node that has a GPU:

kubectl label node <node-name> pangalactic.io/gpu=true

For CUDA nodes specifically (the default Node image requires CUDA):

kubectl label node <node-name> pangalactic.io/gpu=true
# The NVIDIA GPU operator handles nvidia.com/gpu resource allocation automatically

For ROCm nodes, use the ROCm image by editing the DaemonSet:

kubectl patch daemonset pangalactic-node -n pangalactic \
  -p '{"spec":{"template":{"spec":{"containers":[{"name":"node","image":"pangalactic/node-rocm:latest"}]}}}}'

Step 6: Deploy the Node DaemonSet

kubectl apply -f deploy/kubernetes/node/daemonset.yaml -n pangalactic

The DaemonSet will schedule one Node Pod per labelled GPU node. Verify:

kubectl get pods -n pangalactic -l app=pangalactic-node

Each pod should reach Running status. Initial startup may take 30–60 seconds as the node registers with the Hub and awaits shard assignment.


Step 7: Deploy the Gateway

kubectl apply -f deploy/kubernetes/gateway/deployment.yaml -n pangalactic
kubectl rollout status deployment/pangalactic-gateway -n pangalactic

The Gateway Deployment includes an HPA that scales replicas 2–10 based on CPU utilization (70% target).


Step 8: Deploy Monitoring

kubectl apply -f deploy/kubernetes/monitoring/prometheus.yaml -n pangalactic

For Grafana, install the Grafana Helm chart pointed at the Prometheus service:

helm repo add grafana https://grafana.github.io/helm-charts
helm install grafana grafana/grafana \
  --namespace pangalactic \
  --set adminPassword=changeme \
  --set datasources."datasources\.yaml".apiVersion=1 \
  --set datasources."datasources\.yaml".datasources[0].name=Prometheus \
  --set datasources."datasources\.yaml".datasources[0].type=prometheus \
  --set datasources."datasources\.yaml".datasources[0].url=http://prometheus:9090 \
  --set datasources."datasources\.yaml".datasources[0].isDefault=true

Step 9: Load a Model

Get the Gateway service external IP:

kubectl get service pangalactic-gateway -n pangalactic
# NAME                   TYPE           EXTERNAL-IP   PORT(S)
# pangalactic-gateway    LoadBalancer   203.0.113.42  80:32000/TCP

Download and load a model (run from a node with volume access, or use a Job):

export GATEWAY_URL=http://203.0.113.42
export API_KEY=<key from Step 3>

pangalactic model download tinyllama-1.1b --output-dir /data/models
pangalactic cluster topology --hub-addr <hub-cluster-ip>:7100

Verification

curl http://<GATEWAY_IP>/v1/chat/completions \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"tinyllama-1.1b","messages":[{"role":"user","content":"Hello"}]}'

Check readiness:

curl http://<GATEWAY_IP>/readyz
# {"ready": true}

Upgrading

Rolling update of the Gateway (zero-downtime):

kubectl set image deployment/pangalactic-gateway gateway=pangalactic/gateway:v0.2.0 -n pangalactic
kubectl rollout status deployment/pangalactic-gateway -n pangalactic

Upgrading a Node image causes the DaemonSet to roll out one node at a time. In-flight requests on each node drain before the pod is terminated (uses terminationGracePeriodSeconds: 120).

The Hub upgrade requires a brief downtime (single replica). Schedule during low-traffic windows and ensure etcd has a recent snapshot.


Configuration Reference

All configuration is via environment variables. Key values:

Variable Component Default Description
HUB_BIND_ADDR Hub 0.0.0.0:7100 gRPC listen address
HUB_STATE_BACKEND Hub sqlite sqlite or etcd
HUB_ETCD_ENDPOINTS Hub http://localhost:2379 Comma-separated etcd endpoints
HUB_HEARTBEAT_TIMEOUT_S Hub 15 Seconds before node marked FAILED
NODE_HUB_ADDR Node localhost:7100 Hub gRPC address
NODE_GPU_COMPUTE_TYPE Node cuda cuda, rocm, metal, cpu
NODE_SIMULATED_VRAM_GB Node 0 Non-zero enables simulated GPU (dev)
NODE_INFERENCE_BACKEND Node transformers transformers, mock
GATEWAY_BIND_ADDR Gateway 0.0.0.0:8080 HTTP listen address
GATEWAY_HUB_ADDR Gateway localhost:7100 Hub gRPC address
GATEWAY_API_KEY Gateway `` (no auth) Bearer token for API authentication
GATEWAY_RATE_LIMIT_RPM Gateway 0 (disabled) Requests per minute per key

See Also