Runbooks

Overview

This module contains runbooks — case-specific procedures for recovery, data correction, and incident response in the Event and Membership Administration system. Each runbook captures the full workflow of a specific operational scenario: scoping the damage, accessing the necessary environments, applying the correction safely, and verifying the result.

Runbooks complement the more general Operations module. Where Operations documents the day-to-day steady-state procedures (cache management, imports, management API access), Runbooks documents the exceptional operations: the surgical restores, the data migrations, the cross-system reconciliations.

Runbooks are project-specific. The generic skills that underpin them (for example, restoring a MySQL Operator backup into a target schema, or accessing the idealogic-prod cluster) live in the global skills repository at ~/dev/ai-skills-develop/skills/.

Deployment

  • ArgoCD Deployment Patterns — GitOps deployment via ArgoCD Application manifests: structure, environment promotion pattern (dev → stage → prod), secret management, image-pull secrets, and infrastructure Applications (OTel Collector, MySQL Operator, cert-manager). Reference for onboarding new services like admin-portal.

Data Correction & Restore

  • wp_users / wp_usermeta surgical restore for the WPCA UID-collision incident — Surgical restore of wp_users + wp_usermeta rows that were overwritten when an EP import shipped the legacy UID column and wrote onto unrelated new-system user records. Covers the full 11-step workflow: scoping, backup extract into an isolated schema, purging EPs/Results, transactional restore, re-importing, verification, and cleanup.

Writing a new runbook

When you write a runbook, split the generic reusable procedure from the case-specific application:

  • Generic procedure — goes into ~/dev/ai-skills-develop/skills/ as a SKILL.md. Applies across projects / environments / incidents.

  • Case-specific application — goes here, in docs-event/modules/runbooks/. References the skill for the workflow, captures the decisions, data volumes, SQL, and outcomes specific to this project / this incident.

A good runbook answers:

  1. Context — what happened, why is the procedure needed, what are the constraints (live system, disallowed operations, etc.)

  2. Scope — exactly which rows / files / environments are affected; what is in and out of scope

  3. Execution — the step-by-step sequence, with the concrete SQL / kubectl / curl commands, expected row counts, and verification gates

  4. Decisions made — the design choices that the procedure embodies (e.g. blanket replace vs field-level allow-list, same-cluster vs standalone target)

  5. Retention — which snapshot tables / schemas / logs stay, and when they should be cleaned up

  6. References — links to the source code inspected, related ADO items, design journal entries, skills used