--- title: "Cubie Day-2 Operations Runbook" subtitle: "SLA, support, incident playbook, telemetry, upgrades, DR" date: "2026-05-30" --- # Cubie Day-2 Operations Runbook **Purpose.** What an ops team needs to run Cubie in production for years 1+. Drafted with WWT managed-services teaming in mind: Centillion engineering owns the product; WWT can own customer-facing managed-services delivery using this runbook as the playbook. --- ## Service-level objectives | Component | SLO | Measurement | |---|---|---| | Admit-gate availability | 99.95% | per-instance uptime, monthly window | | Admit-decision p99 latency | < 1 ms | Prometheus histogram | | Dual-ledger write availability | 99.95% | PostgreSQL + OpenBao HA | | Telemetry pipeline freshness | < 30 sec lag | exporter timestamp − scrape timestamp | | Critical-bug response | < 1 hour (Gold tier); < 4 hours (Silver); < business-day (Bronze) | incident-create to engineer-acknowledged | ## Support tier matrix | Tier | Availability | Response | Annual price class | |---|---|---|---| | Bronze | Business hours US-Eastern | < 1 business day | Included with license | | Silver | 24/5 (Mon-Fri global) | < 4 hours critical | $15K-$30K | | Gold | 24/7 + named engineer | < 1 hour critical | $40K-$80K | | Platinum (gov/regulated) | 24/7 + on-call + quarterly on-site | < 30 min critical | $100K+ | --- ## Incident playbook ### Severity classification | Sev | Trigger | Initial response | |---|---|---| | **Sev 1 (critical)** | Cubie admit-gate down on production cluster; customer reports denials cascading; data-corruption suspected | All hands, named engineer on call within SLA window. Customer pages directly. | | **Sev 2 (major)** | Cubie active but degraded (latency >5 ms p99, or denial rate >2× baseline); shadow-mode-only fallback active | Engineer assigned within 2× SLA; daily standups until resolution. | | **Sev 3 (minor)** | Single denial-class anomaly; telemetry gap; non-functional issue | Ticketed; addressed in next sprint. | | **Sev 4 (info)** | Feature request, documentation gap, edge-case question | Backlog. | ### Sev 1 playbook (sample) ``` T+0 Page received from customer; on-call engineer acks within SLA T+5 min Engineer joins Slack/Teams bridge; reviews customer-visible symptoms T+10 min Pull last 5 minutes of: - cubie-admit-gate logs - dual-ledger write success rate - DCGM telemetry on the protected cluster - Prometheus histograms (latency, decisions) T+15 min Triage: is the issue (a) admit-gate code, (b) gateway integration, (c) cluster fabric, (d) customer config? T+30 min If admit-gate code: - failover to standby admit-gate instance (if HA pair) - OR engage bypass mode (operator-pulled, audit-logged) - file engineering ticket If gateway/customer config: - guide customer to corrective action; cite playbook section If cluster fabric: - hand off to customer DC team T+60 min Customer comms: incident status, ETA to resolution T+... Resolution; post-incident review scheduled within 5 business days ``` ### Bypass mode (always available) ```bash # Pause Cubie enforcement; observation continues curl -X POST http://:8910/v1/bypass \ -H "X-Operator-ID: " \ -d '{"reason":"","duration":"60m"}' # Resume enforcement curl -X POST http://:8910/v1/resume \ -H "X-Operator-ID: " ``` Bypass writes to dual-ledger. Audit team can replay the timeline. Bypass is REVERSIBLE in seconds (no state migration). --- ## Telemetry pipeline ### Standard exports Cubie exports OpenTelemetry-compatible metrics, traces, logs. **Metrics (Prometheus):** - `cubie_admit_decisions_total{verdict,denial_class,deployment}` - `cubie_admit_latency_seconds` (histogram) - `cubie_dual_ledger_writes_total{ledger,status}` - `cubie_attestation_refresh_total{source}` - `cubie_bypass_active{operator}` (gauge, 0/1) - `cubie_uptime_seconds` - `cubie_capability_hmac_validation_total{status}` **Logs (Loki / generic):** - Structured JSON; one per admit decision (sampled when traffic > 10K req/sec) - Mandatory fields: `timestamp`, `request_id`, `verdict`, `denial_class`, `latency_us`, `attestation_status` - Sampling: 100% of denials; 1% of passes (configurable) **Traces (OTLP):** - Per-request span: admit decision + ledger writes - Parent-span carries customer's request-trace context ### Dashboard set Out-of-box Grafana dashboards (importable via JSON): 1. **Admission overview** — decisions/sec by verdict, latency histogram, denial-class pie 2. **Latency drill-down** — p50/p95/p99 per deployment, per-class 3. **Denial pattern** — top denial classes over time, denial reasons table 4. **Capacity recovery** — wasted-cycle ratio (estimated from denial volume) 5. **SLA compliance** — uptime, decision-rate, latency-vs-SLO over rolling windows 6. **Attestation status** — TDX/SEV-SNP attestation freshness; expiry runway 7. **Dual-ledger health** — write rate, write failures, replication lag --- ## Version upgrade procedure **Cubie semver discipline:** - **Major** = breaking wire-format change (~2 year cadence; e.g. CubeObjectV1 → V2) - **Minor** = new admit gate, new theorem family, new policy capability (~quarterly) - **Patch** = bug fix, telemetry improvement, performance tune (~monthly) ### Upgrade sequence (sidecar deployment) ```bash # 1. Pull new image docker pull centillion/cubie-admit-gate:v1.2.1 # 2. Start standby instance side-by-side on different port docker run -d --name cubie-admit-gate-v1.2.1 \ -p 8911:8910 \ -e CUBIE_MODE=standby \ centillion/cubie-admit-gate:v1.2.1 # 3. Verify standby is healthy curl http://localhost:8911/health curl http://localhost:8911/v1/version # should report v1.2.1 # 4. Shadow-mode comparison (10 min) curl -X POST http://localhost:8911/v1/policy \ -d '{"mode":"shadow","compare_to":"http://localhost:8910"}' # Watch the dashboard: standby's decisions should match active for 10 min # 5. Cut over docker stop cubie-admit-gate docker rename cubie-admit-gate-v1.2.1 cubie-admit-gate docker run ... -p 8910:8910 ... # promote standby to active # 6. Verify production traffic flowing through new version curl http://localhost:8910/v1/version # v1.2.1 ``` If standby's decisions diverge from active during shadow comparison → halt, investigate, file bug. ### Wire-format upgrade (major version) Cubie supports `CubeObjectV1` and `CubeObjectV2` simultaneously during transition. Customer's gateway can emit either; admit gate handles both. Standard transition window: 90 days. Customer cuts over at their pace. --- ## Disaster recovery ### Dual-ledger replication Only stateful component. Standard PostgreSQL streaming replication + OpenBao HA. **Recovery procedure (region failure):** 1. PostgreSQL: promote standby in alternate region (Patroni/auto-failover or manual) 2. OpenBao: leader election to alternate region 3. Cubie admit gates: re-point to alternate region's ledger (config update) 4. Customer traffic: unchanged (admit gates were not the durable state) **RPO (Recovery Point Objective):** < 5 sec for sync replication; < 1 min for async. **RTO (Recovery Time Objective):** < 5 min for automated failover; < 30 min for manual. ### Key rotation Capability HMAC keys rotate per epoch (default: 24-hour epoch). Customer ops team can force-rotate via: ```bash curl -X POST http://:8910/v1/keys/rotate \ -d '{"reason":"scheduled","operator":""}' ``` Old epoch's HMACs remain valid for 1 grace-period epoch (default 60 min) to handle in-flight requests. After grace period, requests with old HMACs deny with class `LOCK` and clear remediation message (refresh capability). ### Configuration backup All policy config is YAML in version control (customer's GitOps repo or Centillion-provided). Recovery = clone, apply. --- ## Routine maintenance ### Daily - Review Sev 3+ incidents in dashboard - Check dual-ledger write success rate (target: 100%) - Confirm attestation refresh successful (last refresh < epoch window) ### Weekly - Rotate Grafana dashboard ownership - Review denial-class distribution; flag anomalies for customer - Validate backup recoverability (test restore in staging) ### Monthly - Engineering call to review SLOs vs SLA - Customer business review (denial trends, capacity recovery vs ROI projection) - Patch upgrade if Centillion released a v1.2.x ### Quarterly - Minor version upgrade rehearsal in staging - Pen-test rerun (year 1) or smaller assessment (year 2+) - Customer satisfaction survey (Net Promoter Score) - Cost-of-Cubie / Cost-of-protected-GPU-cycles ROI refresh ### Annually - Major version upgrade window (if applicable) - Compliance recertification (SOC 2 Type II annual) - Pen-test full rerun --- ## Common troubleshooting | Symptom | First check | Second check | |---|---|---| | All requests denying with LOCK | Capability HMAC keys expired | Attestation refresh window | | Latency spike on p99 only | Garbage-collection pause (dual-ledger PG) | OpenBao seal/unseal pulse | | Denial-class distribution shifted unexpectedly | New attacker pattern in customer traffic | Customer recently rolled out new app config | | Dual-ledger write failures | PG/Vault connectivity | Disk/network | | Telemetry gap | OTLP exporter restart needed | Customer's Prometheus scrape config | | Bypass mode stuck active | Operator forgot to resume | Audit log review | --- ## What to do when you don't know what to do 1. Engage bypass mode (reversible, audit-logged, customer-safe) 2. Page Centillion via support tier path 3. Capture last 5 minutes of full telemetry (`curl http://localhost:8910/v1/debug/snapshot`) and attach to ticket 4. Customer comms: "Cubie is in bypass; cluster behavior unchanged from pre-Cubie baseline; under investigation" This is the unstuck-button. It exists because we know not every incident is one we've seen before. Use it. --- ## Handoff to WWT managed-services (optional) This runbook is structured so WWT's managed-services team can own Day-2 operations as a packaged offering. Centillion provides: - Software (the admit gate + control plane + projector + ledger images) - Updates (signed binaries; published version stream) - L3 escalation engineering (named engineer on Gold/Platinum) - Quarterly product review (roadmap + roadmap dependency on customer feedback) WWT provides (in managed-services model): - L1 customer-facing on-call (using this runbook) - L2 deployment + integration support - Customer reporting (using the metrics + dashboards above) - Project management for upgrades + DR exercises Pricing split is per the standard WWT ISV-managed-services template.