Creating a Runbooks Module
A runbook captures the full workflow of an exceptional operational scenario — a surgical restore, a data migration, a cross-system reconciliation, an incident response. It is engineer-facing and one-off in nature, even when the pattern recurs.
1. When to use this type
Create a Runbooks module when engineers periodically need to perform non-routine, high-stakes operations that must be done carefully and identically each time: scoping damage, accessing sensitive environments, applying a correction safely, and verifying it.
| It IS a runbook when… | It is NOT a runbook when… |
|---|---|
It recovers from or corrects an exceptional situation |
It is routine system upkeep → Operations |
An engineer (with DB/cluster access) runs it |
A staff member runs it for a business outcome → Procedures |
Getting it wrong is costly (data loss, outage) |
It is low-stakes and frequent |
|
Split the generic from the specific. The reusable technique (e.g. "restore a backup into a temporary schema") belongs in a skill / how-to, reusable across incidents. The runbook captures the case-specific application: the actual scoping, data volumes, SQL, decisions, and outcome. |
2. Standard structure
modules/runbooks/
├── nav.adoc
└── pages/
├── index.adoc # Catalogue, grouped by scenario class
├── <scenario-recovery>.adoc # One runbook per scenario
└── <data-correction>.adoc
3. Page template
A runbook answers six questions. Copy templates/runbook-template.adoc, or use this skeleton:
= Runbook: <Scenario>
:description: <one line — what this recovers/corrects and the trigger>
== Context
<What happened / why this is needed; the constraints (live system, disallowed ops).>
== Scope
<Exactly which rows / files / environments are affected; what is in and out of scope.>
== Execution
. <Numbered step with concrete SQL / kubectl / curl, expected counts, verification gate>
== Decisions
<The design choices the procedure embodies (blanket vs field-level, same-cluster vs standalone).>
== Retention
<Which snapshot tables / schemas / logs stay, and when to clean them up.>
== References
<Source code inspected, related work items, design-journal entries, skills used.>
|
Runbooks operate on production. Every destructive step must state its expected result (row counts, affected objects) as a verification gate before the next step. A runbook with no gates is unsafe. |
4. Flexibility
-
No runbooks yet: omit the module entirely until the first real incident produces one — do not pre-create it.
-
Pattern recurs across cases: keep one runbook per incident, and extract the shared technique into a reusable skill/how-to that each runbook references.
-
Deployment/GitOps references sometimes land here; if they are routine rather than exceptional, they belong in Operations instead.
5. Quality checklist
-
Context explains why and the constraints.
-
Scope is exact — what is in and out.
-
Every destructive step has an expected result / verification gate.
-
Decisions made are recorded (so a future operator understands the trade-offs).
-
Retention/cleanup of snapshots and temporary schemas is stated.
-
Generic technique is referenced, not inlined.
6. Common pitfalls
|
7. Related
-
Documentation Architecture — where this type fits.
-
Operations Guide — the steady-state, technical sibling.
-
Procedures Guide — the operator-facing sibling.