Creating an Operations Module

An operations document keeps the system itself healthy at runtime — cache eviction, log levels, observability, configuration, outbound mail. It is engineer-facing and steady-state: the day-to-day upkeep that does not require a deployment.

1. When to use this type

Create an Operations module when engineers/SREs need documented, repeatable runtime control over a deployed system: inspecting health, changing log levels, evicting caches, reading traces, configuring integrations.

It IS operations when… It is NOT operations when…

It keeps the running system healthy

It produces a business outcome → Procedures

It is routine / steady-state

It recovers from an exceptional event → Runbooks

The audience is an engineer / SRE

The audience is an operator learning a feature → User Guide

It is runtime (no redeploy)

It is build/release/infra provisioning → a Deployment module

2. Standard structure

modules/operations/
├── nav.adoc
└── pages/
    ├── index.adoc                    # Overview + boundary with Procedures/Runbooks/Deployment
    ├── management-api-access.adoc     # One page per operational concern
    ├── cache-management.adoc
    ├── observability.adoc
    └── <integration>-configuration.adoc

The index.adoc should explicitly state the boundary: "this module keeps the system healthy; for business outcomes see Procedures, for exceptional recovery see Runbooks, for build/deploy see Deployment."

3. Page template

Operations pages vary more than procedures, but each should cover: purpose, prerequisites / access, how to do it (commands/endpoints), how to verify, and cautions. Copy templates/operations-template.adoc, or:

= <Operational concern>
:description: <one line — what runtime capability this covers>

== Overview
<What this controls and when an engineer needs it.>

== Access
<Authentication, endpoints, tunnels, required roles.>

== Procedure
<Commands / API calls / config, with expected output.>

== Verification
<How to confirm the change took effect (e.g. cluster-wide).>

== Cautions
<Blast radius, timing windows, what NOT to do in production.>

4. Flexibility

  • Library / SDK: usually no Operations module — there is no running system to keep healthy.

  • Single service: a few pages (health/management API, cache, observability) typically suffice.

  • Platform / multi-service: split by concern and consider a dedicated Observability or Configuration sub-area.

5. Quality checklist

  • Each page states access/prerequisites and the verification step.

  • Cautions cover blast radius and production timing.

  • The index draws the boundary with Procedures, Runbooks, and Deployment.

  • No business-outcome content (that is Procedures); no incident recovery (that is Runbooks).

6. Common pitfalls

  • Becoming a dumping ground — operator procedures and incident runbooks drift in because "operations" sounds generic. Keep it to system health; apply the two axes.

  • Missing verification — "evict the cache" without "confirm cluster-wide" leaves stale state.

  • Secrets in steps — reference where credentials live; never inline them.