Creating a Runbooks Module

A runbook captures the full workflow of an exceptional operational scenario — a surgical restore, a data migration, a cross-system reconciliation, an incident response. It is engineer-facing and one-off in nature, even when the pattern recurs.

1. When to use this type

Create a Runbooks module when engineers periodically need to perform non-routine, high-stakes operations that must be done carefully and identically each time: scoping damage, accessing sensitive environments, applying a correction safely, and verifying it.

It IS a runbook when… It is NOT a runbook when…

It recovers from or corrects an exceptional situation

It is routine system upkeep → Operations

An engineer (with DB/cluster access) runs it

A staff member runs it for a business outcome → Procedures

Getting it wrong is costly (data loss, outage)

It is low-stakes and frequent

Split the generic from the specific. The reusable technique (e.g. "restore a backup into a temporary schema") belongs in a skill / how-to, reusable across incidents. The runbook captures the case-specific application: the actual scoping, data volumes, SQL, decisions, and outcome.

2. Standard structure

modules/runbooks/
├── nav.adoc
└── pages/
    ├── index.adoc                  # Catalogue, grouped by scenario class
    ├── <scenario-recovery>.adoc    # One runbook per scenario
    └── <data-correction>.adoc

3. Page template

A runbook answers six questions. Copy templates/runbook-template.adoc, or use this skeleton:

= Runbook: <Scenario>
:description: <one line — what this recovers/corrects and the trigger>

== Context
<What happened / why this is needed; the constraints (live system, disallowed ops).>

== Scope
<Exactly which rows / files / environments are affected; what is in and out of scope.>

== Execution
. <Numbered step with concrete SQL / kubectl / curl, expected counts, verification gate>

== Decisions
<The design choices the procedure embodies (blanket vs field-level, same-cluster vs standalone).>

== Retention
<Which snapshot tables / schemas / logs stay, and when to clean them up.>

== References
<Source code inspected, related work items, design-journal entries, skills used.>

Runbooks operate on production. Every destructive step must state its expected result (row counts, affected objects) as a verification gate before the next step. A runbook with no gates is unsafe.

4. Flexibility

  • No runbooks yet: omit the module entirely until the first real incident produces one — do not pre-create it.

  • Pattern recurs across cases: keep one runbook per incident, and extract the shared technique into a reusable skill/how-to that each runbook references.

  • Deployment/GitOps references sometimes land here; if they are routine rather than exceptional, they belong in Operations instead.

5. Quality checklist

  • Context explains why and the constraints.

  • Scope is exact — what is in and out.

  • Every destructive step has an expected result / verification gate.

  • Decisions made are recorded (so a future operator understands the trade-offs).

  • Retention/cleanup of snapshots and temporary schemas is stated.

  • Generic technique is referenced, not inlined.

6. Common pitfalls

  • No verification gates — destructive steps without expected counts are how incidents compound.

  • Generic + specific tangled — the reusable technique and this incident’s specifics blur, so neither is reusable nor precise.

  • Stale runbooks — an un-dated runbook against a schema that has since changed is a trap; date them and note the environment they were validated against.

  • Misfiled routine ops — recurring upkeep is Operations, not a Runbook.