Production Deployment

This guide covers deploying PanGalactic to a Kubernetes cluster with a highly available Hub backed by etcd.

Architecture Overview

                    [ Load Balancer ]
                          |
              ┌───────────┴───────────┐
       Gateway Pod               Gateway Pod    (replicas: 2+)
              └───────────┬───────────┘
                          │ gRPC
                    [ Hub Pod ]              (replicas: 1, etcd-backed)
                          │ gRPC
           ┌──────────────┼──────────────┐
      Node DaemonSet  Node DaemonSet  Node DaemonSet  (one per GPU node)
           │
      [ etcd StatefulSet ]  (3 replicas for HA)

Prerequisites

Kubernetes 1.28+
kubectl configured for your cluster
NVIDIA GPU operator installed (for CUDA nodes)
Persistent storage class available (for Hub state)

Step 1: Create the Namespace

kubectl create namespace pangalactic

Step 2: Deploy etcd

For production, the Hub uses etcd instead of SQLite. The etcd StatefulSet provides consensus and survives Hub Pod restarts.

kubectl apply -f deploy/kubernetes/etcd/statefulset.yaml -n pangalactic
kubectl rollout status statefulset/etcd -n pangalactic

Verify etcd is healthy:

kubectl exec -it etcd-0 -n pangalactic -- etcdctl endpoint health

Step 3: Configure Secrets

Create the Gateway API key secret:

kubectl create secret generic pangalactic-gateway-secret \
  --from-literal=api-key="$(openssl rand -hex 32)" \
  -n pangalactic

Store the key securely — you will need it for all client requests.

Step 4: Deploy the Hub

kubectl apply -f deploy/kubernetes/hub/deployment.yaml -n pangalactic
kubectl rollout status deployment/pangalactic-hub -n pangalactic

Set the Hub to use etcd:

kubectl set env deployment/pangalactic-hub \
  HUB_STATE_BACKEND=etcd \
  HUB_ETCD_ENDPOINTS=http://etcd-0.etcd:2379,http://etcd-1.etcd:2379,http://etcd-2.etcd:2379 \
  -n pangalactic

Step 5: Label GPU Nodes

For each node that has a GPU:

kubectl label node <node-name> pangalactic.io/gpu=true

For CUDA nodes specifically (the default Node image requires CUDA):

kubectl label node <node-name> pangalactic.io/gpu=true
# The NVIDIA GPU operator handles nvidia.com/gpu resource allocation automatically

For ROCm nodes, use the ROCm image by editing the DaemonSet:

kubectl patch daemonset pangalactic-node -n pangalactic \
  -p '{"spec":{"template":{"spec":{"containers":[{"name":"node","image":"pangalactic/node-rocm:latest"}]}}}}'

Step 6: Deploy the Node DaemonSet

kubectl apply -f deploy/kubernetes/node/daemonset.yaml -n pangalactic

The DaemonSet will schedule one Node Pod per labelled GPU node. Verify:

kubectl get pods -n pangalactic -l app=pangalactic-node

Each pod should reach Running status. Initial startup may take 30–60 seconds as the node registers with the Hub and awaits shard assignment.

Step 7: Deploy the Gateway

kubectl apply -f deploy/kubernetes/gateway/deployment.yaml -n pangalactic
kubectl rollout status deployment/pangalactic-gateway -n pangalactic

The Gateway Deployment includes an HPA that scales replicas 2–10 based on CPU utilization (70% target).

Step 8: Deploy Monitoring

kubectl apply -f deploy/kubernetes/monitoring/prometheus.yaml -n pangalactic

For Grafana, install the Grafana Helm chart pointed at the Prometheus service:

helm repo add grafana https://grafana.github.io/helm-charts
helm install grafana grafana/grafana \
  --namespace pangalactic \
  --set adminPassword=changeme \
  --set datasources."datasources\.yaml".apiVersion=1 \
  --set datasources."datasources\.yaml".datasources[0].name=Prometheus \
  --set datasources."datasources\.yaml".datasources[0].type=prometheus \
  --set datasources."datasources\.yaml".datasources[0].url=http://prometheus:9090 \
  --set datasources."datasources\.yaml".datasources[0].isDefault=true

Step 9: Load a Model

Get the Gateway service external IP:

kubectl get service pangalactic-gateway -n pangalactic
# NAME                   TYPE           EXTERNAL-IP   PORT(S)
# pangalactic-gateway    LoadBalancer   203.0.113.42  80:32000/TCP

Download and load a model (run from a node with volume access, or use a Job):

export GATEWAY_URL=http://203.0.113.42
export API_KEY=<key from Step 3>

pangalactic model download tinyllama-1.1b --output-dir /data/models
pangalactic cluster topology --hub-addr <hub-cluster-ip>:7100

Verification

curl http://<GATEWAY_IP>/v1/chat/completions \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"tinyllama-1.1b","messages":[{"role":"user","content":"Hello"}]}'

Check readiness:

curl http://<GATEWAY_IP>/readyz
# {"ready": true}

Upgrading

Rolling update of the Gateway (zero-downtime):

kubectl set image deployment/pangalactic-gateway gateway=pangalactic/gateway:v0.2.0 -n pangalactic
kubectl rollout status deployment/pangalactic-gateway -n pangalactic

Upgrading a Node image causes the DaemonSet to roll out one node at a time. In-flight requests on each node drain before the pod is terminated (uses terminationGracePeriodSeconds: 120).

The Hub upgrade requires a brief downtime (single replica). Schedule during low-traffic windows and ensure etcd has a recent snapshot.

Configuration Reference

All configuration is via environment variables. Key values:

Variable	Component	Default	Description
`HUB_BIND_ADDR`	Hub	`0.0.0.0:7100`	gRPC listen address
`HUB_STATE_BACKEND`	Hub	`sqlite`	`sqlite` or `etcd`
`HUB_ETCD_ENDPOINTS`	Hub	`http://localhost:2379`	Comma-separated etcd endpoints
`HUB_HEARTBEAT_TIMEOUT_S`	Hub	`15`	Seconds before node marked FAILED
`NODE_HUB_ADDR`	Node	`localhost:7100`	Hub gRPC address
`NODE_GPU_COMPUTE_TYPE`	Node	`cuda`	`cuda`, `rocm`, `metal`, `cpu`
`NODE_SIMULATED_VRAM_GB`	Node	`0`	Non-zero enables simulated GPU (dev)
`NODE_INFERENCE_BACKEND`	Node	`transformers`	`transformers`, `mock`
`GATEWAY_BIND_ADDR`	Gateway	`0.0.0.0:8080`	HTTP listen address
`GATEWAY_HUB_ADDR`	Gateway	`localhost:7100`	Hub gRPC address
`GATEWAY_API_KEY`	Gateway	`` (no auth)	Bearer token for API authentication
`GATEWAY_RATE_LIMIT_RPM`	Gateway	`0` (disabled)	Requests per minute per key