Runbooks
Overview
This module contains runbooks — case-specific procedures for recovery, data correction, and incident response in the Event and Membership Administration system. Each runbook captures the full workflow of a specific operational scenario: scoping the damage, accessing the necessary environments, applying the correction safely, and verifying the result.
Runbooks complement the more general Operations module. Where Operations documents the day-to-day steady-state procedures (cache management, imports, management API access), Runbooks documents the exceptional operations: the surgical restores, the data migrations, the cross-system reconciliations.
|
Runbooks are project-specific. The generic skills that underpin them (for example, restoring a MySQL Operator backup into a target schema, or accessing the idealogic-prod cluster) live in the global skills repository at |
Deployment
-
ArgoCD Deployment Patterns — GitOps deployment via ArgoCD Application manifests: structure, environment promotion pattern (dev → stage → prod), secret management, image-pull secrets, and infrastructure Applications (OTel Collector, MySQL Operator, cert-manager). Reference for onboarding new services like admin-portal.
Data Correction & Restore
-
wp_users / wp_usermeta surgical restore for the WPCA UID-collision incident — Surgical restore of
wp_users+wp_usermetarows that were overwritten when an EP import shipped the legacyUIDcolumn and wrote onto unrelated new-system user records. Covers the full 11-step workflow: scoping, backup extract into an isolated schema, purging EPs/Results, transactional restore, re-importing, verification, and cleanup.
Writing a new runbook
When you write a runbook, split the generic reusable procedure from the case-specific application:
-
Generic procedure — goes into
~/dev/ai-skills-develop/skills/as a SKILL.md. Applies across projects / environments / incidents. -
Case-specific application — goes here, in
docs-event/modules/runbooks/. References the skill for the workflow, captures the decisions, data volumes, SQL, and outcomes specific to this project / this incident.
A good runbook answers:
-
Context — what happened, why is the procedure needed, what are the constraints (live system, disallowed operations, etc.)
-
Scope — exactly which rows / files / environments are affected; what is in and out of scope
-
Execution — the step-by-step sequence, with the concrete SQL / kubectl / curl commands, expected row counts, and verification gates
-
Decisions made — the design choices that the procedure embodies (e.g. blanket replace vs field-level allow-list, same-cluster vs standalone target)
-
Retention — which snapshot tables / schemas / logs stay, and when they should be cleaned up
-
References — links to the source code inspected, related ADO items, design journal entries, skills used