Runbooks
Overview
This module contains runbooks — case-specific procedures for recovery, data correction, and incident response in the Event and Membership Administration system. Each runbook captures the full workflow of a specific operational scenario: scoping the damage, accessing the necessary environments, applying the correction safely, and verifying the result.
Runbooks sit on the technical, exceptional side of the operational map (see How This Documentation Is Structured). Where Operations documents the technical day-to-day steady-state of keeping the system healthy (cache management, management API), and Operational Procedures documents the recurring tasks operators run to produce a business outcome (imports, reporting), Runbooks document the exceptional operations: the surgical restores, the data migrations, the cross-system reconciliations.
|
Runbooks are project-specific. The generic skills that underpin them (for example, restoring a MySQL Operator backup into a target schema, or accessing the idealogic-prod cluster) live in the global skills repository at |
Deployment
-
ArgoCD Deployment Patterns — GitOps deployment via ArgoCD Application manifests: structure, environment promotion pattern (dev → stage → prod), secret management, image-pull secrets, and infrastructure Applications (OTel Collector, MySQL Operator, cert-manager). Reference for onboarding new services like admin-portal.
Data Correction & Restore
-
wp_users / wp_usermeta surgical restore for the WPCA UID-collision incident — Surgical restore of
wp_users+wp_usermetarows that were overwritten when an EP import shipped the legacyUIDcolumn and wrote onto unrelated new-system user records. Covers the full 11-step workflow: scoping, backup extract into an isolated schema, purging EPs/Results, transactional restore, re-importing, verification, and cleanup.
Writing a new runbook
When you write a runbook, split the generic reusable procedure from the case-specific application:
-
Generic procedure — goes into
~/dev/ai-skills-develop/skills/as a SKILL.md. Applies across projects / environments / incidents. -
Case-specific application — goes here, in
docs-event/modules/runbooks/. References the skill for the workflow, captures the decisions, data volumes, SQL, and outcomes specific to this project / this incident.
A good runbook answers:
-
Context — what happened, why is the procedure needed, what are the constraints (live system, disallowed operations, etc.)
-
Scope — exactly which rows / files / environments are affected; what is in and out of scope
-
Execution — the step-by-step sequence, with the concrete SQL / kubectl / curl commands, expected row counts, and verification gates
-
Decisions made — the design choices that the procedure embodies (e.g. blanket replace vs field-level allow-list, same-cluster vs standalone target)
-
Retention — which snapshot tables / schemas / logs stay, and when they should be cleaned up
-
References — links to the source code inspected, related ADO items, design journal entries, skills used