---
title: "Cubie Day-2 Operations Runbook"
subtitle: "SLA, support, incident playbook, telemetry, upgrades, DR"
date: "2026-05-30"
---

# Cubie Day-2 Operations Runbook

**Purpose.** What an ops team needs to run Cubie in production for years 1+. Drafted with WWT managed-services teaming in mind: Centillion engineering owns the product; WWT can own customer-facing managed-services delivery using this runbook as the playbook.

---

## Service-level objectives

| Component | SLO | Measurement |
|---|---|---|
| Admit-gate availability | 99.95% | per-instance uptime, monthly window |
| Admit-decision p99 latency | < 1 ms | Prometheus histogram |
| Dual-ledger write availability | 99.95% | PostgreSQL + OpenBao HA |
| Telemetry pipeline freshness | < 30 sec lag | exporter timestamp − scrape timestamp |
| Critical-bug response | < 1 hour (Gold tier); < 4 hours (Silver); < business-day (Bronze) | incident-create to engineer-acknowledged |

## Support tier matrix

| Tier | Availability | Response | Annual price class |
|---|---|---|---|
| Bronze | Business hours US-Eastern | < 1 business day | Included with license |
| Silver | 24/5 (Mon-Fri global) | < 4 hours critical | $15K-$30K |
| Gold | 24/7 + named engineer | < 1 hour critical | $40K-$80K |
| Platinum (gov/regulated) | 24/7 + on-call + quarterly on-site | < 30 min critical | $100K+ |

---

## Incident playbook

### Severity classification

| Sev | Trigger | Initial response |
|---|---|---|
| **Sev 1 (critical)** | Cubie admit-gate down on production cluster; customer reports denials cascading; data-corruption suspected | All hands, named engineer on call within SLA window. Customer pages directly. |
| **Sev 2 (major)** | Cubie active but degraded (latency >5 ms p99, or denial rate >2× baseline); shadow-mode-only fallback active | Engineer assigned within 2× SLA; daily standups until resolution. |
| **Sev 3 (minor)** | Single denial-class anomaly; telemetry gap; non-functional issue | Ticketed; addressed in next sprint. |
| **Sev 4 (info)** | Feature request, documentation gap, edge-case question | Backlog. |

### Sev 1 playbook (sample)

```
T+0       Page received from customer; on-call engineer acks within SLA
T+5 min   Engineer joins Slack/Teams bridge; reviews customer-visible symptoms
T+10 min  Pull last 5 minutes of:
            - cubie-admit-gate logs
            - dual-ledger write success rate
            - DCGM telemetry on the protected cluster
            - Prometheus histograms (latency, decisions)
T+15 min  Triage: is the issue (a) admit-gate code, (b) gateway integration, (c) cluster fabric, (d) customer config?
T+30 min  If admit-gate code:
            - failover to standby admit-gate instance (if HA pair)
            - OR engage bypass mode (operator-pulled, audit-logged)
            - file engineering ticket
          If gateway/customer config:
            - guide customer to corrective action; cite playbook section
          If cluster fabric:
            - hand off to customer DC team
T+60 min  Customer comms: incident status, ETA to resolution
T+...     Resolution; post-incident review scheduled within 5 business days
```

### Bypass mode (always available)

```bash
# Pause Cubie enforcement; observation continues
curl -X POST http://<cubie-control>:8910/v1/bypass \
    -H "X-Operator-ID: <op-name>" \
    -d '{"reason":"<incident-id>","duration":"60m"}'

# Resume enforcement
curl -X POST http://<cubie-control>:8910/v1/resume \
    -H "X-Operator-ID: <op-name>"
```

Bypass writes to dual-ledger. Audit team can replay the timeline. Bypass is REVERSIBLE in seconds (no state migration).

---

## Telemetry pipeline

### Standard exports

Cubie exports OpenTelemetry-compatible metrics, traces, logs.

**Metrics (Prometheus):**
- `cubie_admit_decisions_total{verdict,denial_class,deployment}`
- `cubie_admit_latency_seconds` (histogram)
- `cubie_dual_ledger_writes_total{ledger,status}`
- `cubie_attestation_refresh_total{source}`
- `cubie_bypass_active{operator}` (gauge, 0/1)
- `cubie_uptime_seconds`
- `cubie_capability_hmac_validation_total{status}`

**Logs (Loki / generic):**
- Structured JSON; one per admit decision (sampled when traffic > 10K req/sec)
- Mandatory fields: `timestamp`, `request_id`, `verdict`, `denial_class`, `latency_us`, `attestation_status`
- Sampling: 100% of denials; 1% of passes (configurable)

**Traces (OTLP):**
- Per-request span: admit decision + ledger writes
- Parent-span carries customer's request-trace context

### Dashboard set

Out-of-box Grafana dashboards (importable via JSON):

1. **Admission overview** — decisions/sec by verdict, latency histogram, denial-class pie
2. **Latency drill-down** — p50/p95/p99 per deployment, per-class
3. **Denial pattern** — top denial classes over time, denial reasons table
4. **Capacity recovery** — wasted-cycle ratio (estimated from denial volume)
5. **SLA compliance** — uptime, decision-rate, latency-vs-SLO over rolling windows
6. **Attestation status** — TDX/SEV-SNP attestation freshness; expiry runway
7. **Dual-ledger health** — write rate, write failures, replication lag

---

## Version upgrade procedure

**Cubie semver discipline:**
- **Major** = breaking wire-format change (~2 year cadence; e.g. CubeObjectV1 → V2)
- **Minor** = new admit gate, new theorem family, new policy capability (~quarterly)
- **Patch** = bug fix, telemetry improvement, performance tune (~monthly)

### Upgrade sequence (sidecar deployment)

```bash
# 1. Pull new image
docker pull centillion/cubie-admit-gate:v1.2.1

# 2. Start standby instance side-by-side on different port
docker run -d --name cubie-admit-gate-v1.2.1 \
    -p 8911:8910 \
    -e CUBIE_MODE=standby \
    centillion/cubie-admit-gate:v1.2.1

# 3. Verify standby is healthy
curl http://localhost:8911/health
curl http://localhost:8911/v1/version  # should report v1.2.1

# 4. Shadow-mode comparison (10 min)
curl -X POST http://localhost:8911/v1/policy \
    -d '{"mode":"shadow","compare_to":"http://localhost:8910"}'
# Watch the dashboard: standby's decisions should match active for 10 min

# 5. Cut over
docker stop cubie-admit-gate
docker rename cubie-admit-gate-v1.2.1 cubie-admit-gate
docker run ... -p 8910:8910 ...  # promote standby to active

# 6. Verify production traffic flowing through new version
curl http://localhost:8910/v1/version  # v1.2.1
```

If standby's decisions diverge from active during shadow comparison → halt, investigate, file bug.

### Wire-format upgrade (major version)

Cubie supports `CubeObjectV1` and `CubeObjectV2` simultaneously during transition. Customer's gateway can emit either; admit gate handles both. Standard transition window: 90 days. Customer cuts over at their pace.

---

## Disaster recovery

### Dual-ledger replication

Only stateful component. Standard PostgreSQL streaming replication + OpenBao HA.

**Recovery procedure (region failure):**
1. PostgreSQL: promote standby in alternate region (Patroni/auto-failover or manual)
2. OpenBao: leader election to alternate region
3. Cubie admit gates: re-point to alternate region's ledger (config update)
4. Customer traffic: unchanged (admit gates were not the durable state)

**RPO (Recovery Point Objective):** < 5 sec for sync replication; < 1 min for async.
**RTO (Recovery Time Objective):** < 5 min for automated failover; < 30 min for manual.

### Key rotation

Capability HMAC keys rotate per epoch (default: 24-hour epoch). Customer ops team can force-rotate via:

```bash
curl -X POST http://<cubie-control>:8910/v1/keys/rotate \
    -d '{"reason":"scheduled","operator":"<op>"}'
```

Old epoch's HMACs remain valid for 1 grace-period epoch (default 60 min) to handle in-flight requests. After grace period, requests with old HMACs deny with class `LOCK` and clear remediation message (refresh capability).

### Configuration backup

All policy config is YAML in version control (customer's GitOps repo or Centillion-provided). Recovery = clone, apply.

---

## Routine maintenance

### Daily

- Review Sev 3+ incidents in dashboard
- Check dual-ledger write success rate (target: 100%)
- Confirm attestation refresh successful (last refresh < epoch window)

### Weekly

- Rotate Grafana dashboard ownership
- Review denial-class distribution; flag anomalies for customer
- Validate backup recoverability (test restore in staging)

### Monthly

- Engineering call to review SLOs vs SLA
- Customer business review (denial trends, capacity recovery vs ROI projection)
- Patch upgrade if Centillion released a v1.2.x

### Quarterly

- Minor version upgrade rehearsal in staging
- Pen-test rerun (year 1) or smaller assessment (year 2+)
- Customer satisfaction survey (Net Promoter Score)
- Cost-of-Cubie / Cost-of-protected-GPU-cycles ROI refresh

### Annually

- Major version upgrade window (if applicable)
- Compliance recertification (SOC 2 Type II annual)
- Pen-test full rerun

---

## Common troubleshooting

| Symptom | First check | Second check |
|---|---|---|
| All requests denying with LOCK | Capability HMAC keys expired | Attestation refresh window |
| Latency spike on p99 only | Garbage-collection pause (dual-ledger PG) | OpenBao seal/unseal pulse |
| Denial-class distribution shifted unexpectedly | New attacker pattern in customer traffic | Customer recently rolled out new app config |
| Dual-ledger write failures | PG/Vault connectivity | Disk/network |
| Telemetry gap | OTLP exporter restart needed | Customer's Prometheus scrape config |
| Bypass mode stuck active | Operator forgot to resume | Audit log review |

---

## What to do when you don't know what to do

1. Engage bypass mode (reversible, audit-logged, customer-safe)
2. Page Centillion via support tier path
3. Capture last 5 minutes of full telemetry (`curl http://localhost:8910/v1/debug/snapshot`) and attach to ticket
4. Customer comms: "Cubie is in bypass; cluster behavior unchanged from pre-Cubie baseline; under investigation"

This is the unstuck-button. It exists because we know not every incident is one we've seen before. Use it.

---

## Handoff to WWT managed-services (optional)

This runbook is structured so WWT's managed-services team can own Day-2 operations as a packaged offering. Centillion provides:
- Software (the admit gate + control plane + projector + ledger images)
- Updates (signed binaries; published version stream)
- L3 escalation engineering (named engineer on Gold/Platinum)
- Quarterly product review (roadmap + roadmap dependency on customer feedback)

WWT provides (in managed-services model):
- L1 customer-facing on-call (using this runbook)
- L2 deployment + integration support
- Customer reporting (using the metrics + dashboards above)
- Project management for upgrades + DR exercises

Pricing split is per the standard WWT ISV-managed-services template.
