Recipes

Six worked YAML orchestrations, lifted directly from a running Okesu control plane. Each is annotated with what it does, when to reach for it, and what to notice — copy, adapt, drop in your CP.

T1 first-line, runs without an approval gate · T2 escalation, includes a human-in-the-loop step

EDR Critical Response

T2

Auto-investigate HIGH/CRITICAL EDR findings, hunt the fleet, and pause for operator approval before containment.

When to use A trusted EDR daimon emits a high-severity finding and you want fast triage + cross-fleet hunt + a contain-or-rollback plan ready for human review.
What to notice
  • Auto-fires on a finding match — no human in the loop until the approval gate at step 4.
  • A when: gate skips binary analysis when the triage step did not pinpoint a binary — keeps cheap when there's nothing to dissect.
  • Step 3 fans out across three host classes (threat-rocky-1, threat-fedora-2, threat-debian-2) to look for the same IOCs across the fleet in parallel.
  • Step 4 is approval: required — the run pauses until an operator clicks Approve in the dashboard.
edr-critical-response.yaml click to expand
---
name: edr-critical-response
description: Auto-investigate HIGH or CRITICAL EDR findings, hunt the fleet for related artifacts, and pause for operator approval before containment.

# Auto-fire whenever the eventpipeline projects a HIGH or CRITICAL
# finding from the `edr` daimon. The matching finding's fields land
# on `{{trigger.*}}` for the steps to template against.
trigger:
  on: finding
  filter: "finding.severity in ['HIGH', 'CRITICAL'] && finding.agent == 'edr'"

# Inputs let an operator run this manually for testing — supplying a
# host + finding_id by hand. When the auto-trigger fires, the matching
# finding's fields populate {{trigger.*}} and these defaults aren't
# consulted (auto-trigger payloads carry the real values).
inputs:
  host:
    type: string
    required: false
    default: "threat-rocky-1"
  finding_id:
    type: int
    required: false
    default: 0

defaults:
  timeout: 5m

steps:
  # 1. Triage on the affected host. We only need a single agent run
  #    here — the investigator agent reads the finding context and
  #    pulls relevant evidence (process tree, recent network, file
  #    changes).
  - id: triage
    agent: investigator
    node: "{{trigger.host}}"
    actions:
      - update_finding_status
      - set_finding_severity_override
      - add_finding_tag
      - link_run_to_finding
    prompt: |
      Investigate finding #{{trigger.finding_id}} on {{trigger.host}}.

      Severity: {{trigger.severity}}
      Title: {{trigger.title}}
      Resource: {{trigger.resource}}
      Dedup key: {{trigger.dedup_key}}

      Build a 2-minute-window timeline of process / network / file
      events. If the finding pinpoints a binary (path, sha256), emit
      an orchestration_result finding with attributes:
        sha256, path, cmdline, parent_pid, network_peers (array)

  # 2. Analyze the binary if triage extracted one. The `when` gate
  #    skips the step when there's nothing to analyse.
  - id: analyze
    when: "{{triage.result.sha256 != ''}}"
    agent: binary-analyzer
    node: "{{trigger.host}}"
    timeout: 10m
    actions:
      - update_finding_status
      - set_finding_severity_override
      - add_finding_tag
      - link_run_to_finding
    prompt: |
      Analyze the binary at {{triage.result.path}} (sha256
      {{triage.result.sha256}}).

      Static analysis only — strings, imports, entropy, packers, IOCs.
      Emit an orchestration_result finding with attributes:
        family, confidence (low/medium/high), iocs (array of
        {kind, value} entries), persistence (string)

  # 3. Hunt the fleet. Multi-host fan-out: the same hunt prompt runs
  #    in parallel on three of our most representative hosts, and
  #    findings are merged. Operators editing this in the visual
  #    editor see "3 nodes (fan-out)" on the card.
  - id: hunt
    agent: threat-hunter
    nodes:
      - threat-rocky-1
      - threat-fedora-2
      - threat-debian-2
    timeout: 8m
    actions:
      - update_finding_status
      - set_finding_severity_override
      - add_finding_tag
      - link_run_to_finding
    continue_on_error: true   # one host's hunt failing doesn't kill the chain
    prompt: |
      Hunt for the IOCs from the previous step on this host.

      Target sha256: {{analyze.result.sha256}}
      Family: {{analyze.result.family}}
      Other IOCs: {{analyze.result.iocs | json}}

      Look at process listings, recent shell history, persistence
      mechanisms (systemd units, cron, .bashrc), open network
      connections, and the last 6 hours of relevant log entries.

      Emit an orchestration_result finding with attributes:
        matches (count), evidence (array of strings), additional_hosts (array)

  # 4. Containment plan, gated. The amber pulse on this card in the
  #    canvas tells the operator they need to click Approve before
  #    the incident-responder agent dispatches.
  - id: respond
    approval: required
    agent: incident-responder
    node: "{{trigger.host}}"
    timeout: 15m
    actions:
      - update_finding_status
      - add_finding_tag
      - link_run_to_finding
    prompt: |
      Draft a containment plan for finding {{trigger.finding_id}}.

      Triage summary (last 50 lines):
      {{triage.output | tail(50)}}

      Binary analysis result:
      {{analyze.result | json}}

      Fleet hunt found {{hunt.findings | length}} related artifact(s)
      across {{hunt.nodes | length}} host(s):
      {{hunt.findings | json}}

      Per-host hunt detail:
      - threat-rocky-1: {{hunt.byNode["threat-rocky-1"].findings | length}} matches
      - threat-fedora-2: {{hunt.byNode["threat-fedora-2"].findings | length}} matches
      - threat-debian-2: {{hunt.byNode["threat-debian-2"].findings | length}} matches

      Output a written plan covering:
      1. Immediate isolation (SG / firewall / kubectl cordon)
      2. Evidence preservation (hashes, paths, command list)
      3. Eradication (specific files / accounts / persistence to remove)
      4. Validation steps and rollback procedure if anything breaks

      Then emit an orchestration_result finding with attributes:
        plan (string — the markdown above)
        actions (array)

      Actions to request on finding {{trigger.finding_id}}:
        update_finding_status → investigating
          (reason: "containment plan ready: <one-line summary>")
        add_finding_tag → containment-planned
        link_run_to_finding
---

Finding Autotriage (T1)

T1

Generic Tier-1 triage that runs on every new finding, attaches an evidence summary, and decides whether to escalate.

When to use You want every finding to land with a one-paragraph "what is this" pre-baked, so the human queue starts at "should I act?" instead of "what am I looking at?".
What to notice
  • No when: gate — runs on everything matching the trigger filter; cheap by design.
  • Uses actions: allowlist so the agent can update finding status, set severity overrides, and add tags without an approval gate.
  • Demonstrates the platform's default action-class policy: only what's explicitly listed is permitted.
t1-finding-autotriage.yaml click to expand
---
name: t1-finding-autotriage
description: Tier-1 auto-triage fast-lane for HIGH/CRITICAL findings. Decides real-issue vs known noise; suppresses noise, summarises the rest, and escalates when the finding survives both checks. INFO/LOW findings are handled in batches by `t1-finding-batch-triage` instead — per-finding triage there saturated the API under noise bursts.

# Fires only on HIGH/CRITICAL — the per-finding fast-lane. INFO/LOW
# go through the batched `t1-finding-batch-triage` (cron 5min) so we
# don't pay one LLM startup per noise event.
trigger:
  on: finding
  filter: "finding.severity in ['HIGH', 'CRITICAL']"

inputs:
  host:
    type: string
    required: false
    default: ""
  finding_id:
    type: int
    required: false
    default: 0

defaults:
  timeout: 5m

steps:
  # 1. Classify: noise vs real. The investigator agent reads the
  #    finding context, checks the recent fleet pattern (is this the
  #    same finding firing on every host? is the source a known
  #    scanner / monitoring system?), and emits a verdict +
  #    requests CP-side actions to mutate the finding accordingly.
  - id: classify
    agent: investigator
    node: "{{trigger.host}}"
    actions:
      - update_finding_status
      - set_finding_severity_override
      - add_finding_tag
      - link_run_to_finding
    prompt: |
      Classify finding #{{trigger.finding_id}} on {{trigger.host}}.

      Severity: {{trigger.severity}}
      Title: {{trigger.title}}
      Agent: {{trigger.agent}}
      Resource: {{trigger.resource}}
      Dedup key: {{trigger.dedup_key}}

      Decide:
        - verdict: one of `noise` | `confirmed` | `unknown`
        - reasoning: one short sentence
        - suppress_pattern: glob to suppress future identical findings, or empty
        - escalate: bool — set true ONLY when verdict=confirmed AND severity in (HIGH, CRITICAL)

      Heuristics for `noise`:
        - Same finding fired on >=5 hosts in last 60 min with identical title
        - Source attributes match a known internal scanner / monitor IP
        - The change recorded matches a sanctioned automation (ansible run id, package manager update)
        - Self-reported finding from the agent that itself deployed (collector seeing its own writes)

      Emit an orchestration_result finding with attributes:
        verdict (string), reasoning (string), suppress_pattern (string),
        escalate (bool), actions (array — see below).

      Actions to request:
        - verdict=noise:
            update_finding_status → false_positive (reason: short noise reason)
            set_finding_severity_override → INFO
            add_finding_tag → auto-triaged-noise
            link_run_to_finding
        - verdict=confirmed:
            add_finding_tag → auto-confirmed
            link_run_to_finding
        - verdict=unknown:
            add_finding_tag → needs-human
            link_run_to_finding

      The full action protocol is at agents/_orchestration-actions.md.

  # 2. Auto-suppress when classified as noise. The action runs in the
  #    same node as the finding came from so its scope is local; a
  #    fleet-wide suppression would be an explicit T2 step.
  - id: auto_suppress
    when: "{{classify.result.verdict == 'noise' && classify.result.suppress_pattern != ''}}"
    agent: investigator
    node: "{{trigger.host}}"
    timeout: 2m
    actions:
      - update_finding_status
      - set_finding_severity_override
      - add_finding_tag
      - link_run_to_finding
    prompt: |
      Apply local suppression for finding #{{trigger.finding_id}}.

      Pattern: `{{classify.result.suppress_pattern}}`
      Reason:  {{classify.result.reasoning}}

      Add the pattern to /etc/okesu/suppressions.yml on this host (create if absent), under
      a `local:` block keyed by today's ISO date so we can audit later.

      Emit an orchestration_result finding with attributes:
        applied (bool), suppression_path (string), entries_added (int)

  # 3. Confirmed-real summary. When `escalate` is true, write a
  #    short operator-readable summary so the human opening the
  #    finding sees an executive answer rather than a raw event log.
  - id: summarize
    when: "{{classify.result.escalate == true}}"
    agent: investigator
    node: "{{trigger.host}}"
    timeout: 3m
    actions:
      - update_finding_status
      - set_finding_severity_override
      - add_finding_tag
      - link_run_to_finding
    prompt: |
      Write a one-paragraph operator brief for finding #{{trigger.finding_id}}.

      Verdict: confirmed real
      Reasoning: {{classify.result.reasoning}}

      Keep it ≤120 words. Cover:
        - What happened (what changed / fired / observed)
        - Blast radius (this host? cluster? fleet?)
        - Recommended next action (1-2 bullets)
        - Confidence level

      Emit an orchestration_result finding with attributes:
        brief (string), blast_radius (string: host|cluster|fleet),
        confidence (string: low|medium|high)
---

Failed-Login Noise Dedup (T1)

T1

Bundle authentication-failure noise into one incident; only escalate when a source actually authenticated successfully or hit a sensitive account.

When to use Public-internet hosts that get scanned constantly — you don't want each rejected SSH attempt to appear as its own finding.
What to notice
  • A precise trigger.filter: keeps the orchestration from firing on findings that don't mention auth.
  • Step 1 collects the 60-minute window of failed logins and aggregates by source IP — the orchestration moves the burden of "is this brute-force or background radiation" onto the agent.
  • Step 2 escalates only when conditions hold (success after a streak of failures, or a sensitive account targeted).
t1-failed-login-noise-dedup.yaml click to expand
---
name: t1-failed-login-noise-dedup
description: Tier-1 dedup of authentication-failure noise. Distinguishes brute-force attempts from internet background radiation (vuln scanners, mass DDoS sweeps), bundles repeats into a single SEV-3 incident, and only escalates when the same source actually authenticated successfully or hit a sensitive account.

trigger:
  on: finding
  filter: "finding.title contains 'auth' && (finding.title contains 'fail' || finding.title contains 'invalid')"

inputs:
  host:
    type: string
    required: false
    default: ""
  finding_id:
    type: int
    required: false
    default: 0

defaults:
  timeout: 4m

steps:
  # 1. Source attribution — collect the source IPs across recent
  #    failures, classify each, and decide if any pattern looks like
  #    real brute force vs background scanner traffic.
  - id: triage
    agent: investigator
    node: "{{trigger.host}}"
    actions:
      - update_finding_status
      - set_finding_severity_override
      - add_finding_tag
      - link_run_to_finding
    prompt: |
      Triage failed-login finding #{{trigger.finding_id}} on {{trigger.host}}.

      Gather:
        - `journalctl -u sshd --since "60 minutes ago" | grep -iE "failed|invalid"` (or /var/log/auth.log)
        - Aggregate by source IP: count, first seen, last seen, account targeted
        - Look up each source IP: is it on a known threat-intel feed?
          (Use only what's locally available — no external lookups.)
        - Check for any SUCCESSFUL login from those source IPs in the same window:
          `journalctl -u sshd --since "60 minutes ago" | grep -i "accepted"`

      Classify each source:
        - `scanner` — high-volume, low-effort, hits common usernames (root, admin, oracle), no success
        - `targeted` — focused on a specific real account, slower cadence, sometimes succeeds
        - `unknown` — needs human eyes

      Verdict:
        - severity: one of `noise` | `elevated` | `incident`
          - noise:     all sources are scanners, no successful logins
          - elevated:  at least one targeted source, no success
          - incident:  any successful login from a flagged source
        - sources_count, attempts_total, accounts_hit (array)

      Emit an orchestration_result finding with attributes:
        severity (string)
        sources_count (int)
        attempts_total (int)
        accounts_hit (array of strings)
        has_successful_login (bool)
        worst_source (string — IP)
        rollup_window (string — e.g. "60m")

  # 2. Auto-bundle scanner noise. Writes a single dedup'd finding
  #    in place of N raw ones, suppresses the rest of the bundle
  #    for 24h via local fail2ban-style hosts.deny entry.
  - id: deduplicate
    when: "{{triage.result.severity == 'noise'}}"
    agent: investigator
    node: "{{trigger.host}}"
    timeout: 3m
    actions:
      - update_finding_status
      - set_finding_severity_override
      - add_finding_tag
      - link_run_to_finding
    prompt: |
      Bundle the noise on {{trigger.host}}.

      The triage classified the {{triage.result.sources_count}} sources as
      scanners. Apply automatic mitigation:
        - If iptables/nftables present and a "okesu-scanners" chain exists,
          add the source IPs to it with a 24h timeout
        - Otherwise, append to /etc/hosts.deny for the next 24h with comment
          `# okesu T1 scanner-noise YYYY-MM-DD`
        - Roll the {{triage.result.attempts_total}} raw findings into one
          summary finding tagged `noise-bundled`

      Emit an orchestration_result finding with attributes:
        ips_blocked (int)
        block_method (string: iptables|nftables|hosts.deny|none)
        bundle_id (string)
        actions (array)

      Actions to request on the source finding #{{trigger.finding_id}}:
        update_finding_status → false_positive
          (reason: "scanner noise; bundled into <bundle_id>")
        set_finding_severity_override → INFO
        add_finding_tag → auto-triaged-noise
        add_finding_tag → noise-bundled
        link_run_to_finding

  # 3. Escalate when something targeted the host. The on-call sees
  #    the brief inline rather than digging through 200 raw findings.
  - id: escalate_brief
    when: "{{triage.result.severity != 'noise'}}"
    agent: incident-responder
    node: "{{trigger.host}}"
    timeout: 4m
    actions:
      - update_finding_status
      - set_finding_severity_override
      - add_finding_tag
      - link_run_to_finding
    prompt: |
      Write an on-call brief for the auth-failure incident on {{trigger.host}}.

      Triage said:
        severity={{triage.result.severity}}
        sources={{triage.result.sources_count}}
        attempts={{triage.result.attempts_total}}
        accounts={{triage.result.accounts_hit}}
        had_success={{triage.result.has_successful_login}}
        worst_source={{triage.result.worst_source}}

      Cover:
        - One-paragraph timeline of what happened
        - Whether containment is needed NOW or can wait until business hours
        - Recommended actions, ordered by impact
          (key rotation? fail2ban? service-account audit? IR playbook?)

      Keep it ≤200 words. Emit an orchestration_result finding with attributes:
        urgency (string: low|medium|high)
        recommended_actions (array of strings)
        suggested_severity (string: SEV-2|SEV-3|SEV-4)
---

Fleet IOC Hunt (T2)

T2

Fan out across every host in the fleet to hunt for a confirmed IOC, then aggregate.

When to use You've confirmed a real threat on one host (via EDR Critical Response or manual analysis) and need to know "where else is this?" — fast.
What to notice
  • Pure fan-out shape: one step, nodes: [...], each host runs the same prompt with the same IOC inputs.
  • continue_on_error: true — one host's hunt failing (offline, daimon stale) doesn't kill the run; the synthesizer step accepts partial results.
  • The post-fan-out summarizer reads {{stepN.byNode["host-1"].findings}} to attribute matches per host.
t2-fleet-ioc-hunt.yaml click to expand
---
name: t2-fleet-ioc-hunt
description: Tier-2 fleet-wide IOC hunt. Triggered when any finding surfaces a usable indicator (sha256, IP, domain, key fingerprint). Fans out to every reachable host of the same OS family, hunts the IOC, builds a heatmap, and pauses for operator approval before any containment action.

trigger:
  on: finding
  filter: "(finding.attributes.sha256 != '') || (finding.attributes.ioc != '')"

inputs:
  ioc:
    type: string
    required: false
    default: ""
  ioc_kind:
    type: string
    required: false
    default: "sha256"   # one of sha256|ipv4|domain|ssh_pubkey
  source_host:
    type: string
    required: false
    default: ""

defaults:
  timeout: 12m

steps:
  # 1. Confirm the IOC and pick the hunt scope. Different IOC kinds
  #    point at different host populations — a sha256 spreads via
  #    package/payload (so all hosts of the same OS), an SSH pubkey
  #    spreads via provisioning (so all hosts using the same key
  #    template), an IP/domain via outbound calls (any host).
  - id: scope
    agent: investigator
    node: "{{trigger.source_host}}"
    actions:
      - update_finding_status
      - set_finding_severity_override
      - add_finding_tag
      - link_run_to_finding
    data:
      ioc:
        query: iocs.lookup
        params: { kind: "{{trigger.ioc_kind}}", value: "{{trigger.ioc}}" }
    prompt: |
      Confirm IOC and define hunt scope.

      Source finding's host: {{trigger.source_host}}
      IOC: `{{data.ioc.normalized_value}}` (kind={{data.ioc.kind}})
      Catalog metadata: source={{data.ioc.source}} attribution={{data.ioc.attribution}} severity_floor={{data.ioc.severity_floor}}
      Attributes from triggering finding: {{trigger.attributes | json}}

      Decide hunt scope:
        - target_population: one of `same_os`, `all`, `web_tier`, `db_tier`
        - explanation: one sentence why
        - max_hosts: cap parallel fan-out (default 20)

      Emit an orchestration_result finding with attributes:
        valid (bool, copy from {{data.ioc.valid}})
        normalized_ioc (string, copy from {{data.ioc.normalized_value}})
        target_population (string)
        explanation (string)
        max_hosts (int)

  # 2. Fan out to representative hosts. The operator's lab uses
  #    these names; a real deployment would either inject node
  #    selectors via the inputs block or read them from a tag query
  #    (when the engine adds tag selectors).
  - id: hunt
    when: "{{scope.result.valid == true}}"
    agent: threat-hunter
    nodes:
      - threat-rocky-1
      - threat-rocky-2
      - threat-fedora-1
      - threat-fedora-2
      - threat-debian-1
      - threat-debian-2
      - edr-rocky-1
      - edr-fedora-1
      - edr-debian-1
      - fim-debian-1
      - fim-rocky-1
      - fim-fedora-1
      - sre-debian-1
      - sre-rocky-1
      - sre-fedora-1
      - mixed-east-1
      - mixed-west-1
    timeout: 8m
    actions:
      - update_finding_status
      - set_finding_severity_override
      - add_finding_tag
      - link_run_to_finding
    continue_on_error: true
    prompt: |
      Hunt for IOC `{{scope.result.normalized_ioc}}` ({{trigger.ioc_kind}}) on this host.

      For sha256:
        - find / -type f -size +1k -exec sha256sum {} + 2>/dev/null | grep -F "{{scope.result.normalized_ioc}}"
        - rpm -qa | xargs rpm -ql 2>/dev/null  (or dpkg -L for Debian) and verify integrity
        - Check process memory of long-running daemons

      For ipv4 / domain:
        - `ss -tunap | grep {{scope.result.normalized_ioc}}` (live connections)
        - `journalctl --since "24 hours ago" | grep {{scope.result.normalized_ioc}}` (logs)
        - `ip route get {{scope.result.normalized_ioc}}` if v4

      For ssh_pubkey:
        - Search ~/.ssh/authorized_keys, /root/.ssh/authorized_keys
        - Check /etc/ssh/sshd_config TrustedUserCAKeys
        - grep -rF "{{scope.result.normalized_ioc}}" /etc/ssh /root/.ssh 2>/dev/null

      Emit an orchestration_result finding with attributes:
        host_match (bool)
        evidence (array of strings — paths / connections / lines)
        confidence (string: low|medium|high)
        first_seen (string — RFC3339 from filesystem mtime / log timestamp)

  # 3. Heatmap + recommendation. Aggregate the hunt results into a
  #    spread map and write the on-call brief.
  - id: heatmap
    when: "{{scope.result.valid == true}}"
    agent: incident-responder
    timeout: 5m
    actions:
      - add_finding_tag
      - link_run_to_finding
      - escalate
    prompt: |
      Build the fleet heatmap for IOC `{{scope.result.normalized_ioc}}`.

      Per-host hunt results:
        {{hunt.byNode | json}}

      Aggregate:
        - matched_hosts (array of names)
        - clean_hosts (array of names)
        - errored_hosts (array of names)
        - earliest_first_seen across matches
        - most_common_evidence_type

      Recommend a containment plan, scoped by confidence:
        - confidence=high  → quarantine matched hosts (firewall isolate, snapshot)
        - confidence=medium → snapshot + monitor, no isolation yet
        - confidence=low    → keep watching, ask the operator if they recognise it

      Emit an orchestration_result finding with attributes:
        matched_count (int)
        clean_count (int)
        errored_count (int)
        recommended_action (string: quarantine|snapshot|monitor|noop)
        recommended_severity (string: SEV-1|SEV-2|SEV-3)
        plan (string — multi-line markdown, ≤500 words)
        actions (array)

      Actions to request (always):
        add_finding_tag → ioc-hunted (on the source finding)
        link_run_to_finding (on the source finding)

      If matched_count > 0 AND recommended_severity in (SEV-1, SEV-2):
        escalate (reason: short summary, severity matches recommended_severity)

  # 4. Operator-gated containment. Approve to actually isolate the
  #    matched hosts. The action is intentionally explicit — even
  #    inside an automated orchestration, hard isolation needs a
  #    human "go".
  - id: contain
    approval: required
    when: "{{heatmap.result.recommended_action == 'quarantine'}}"
    agent: incident-responder
    timeout: 10m
    actions:
      - update_finding_status
      - set_finding_severity_override
      - add_finding_tag
      - link_run_to_finding
    prompt: |
      Containment for IOC `{{scope.result.normalized_ioc}}`. Operator approved.

      Targets ({{heatmap.result.matched_count}} hosts): {{heatmap.result.matched_hosts | json}}

      For each target:
        1. Snapshot key state if a `snapshot.sh` is present at /usr/local/bin
        2. Apply network isolation:
           - iptables/nftables: DROP egress except to {{trigger.source_host}}'s management plane
           - macOS: pf rule via `pfctl -e`
        3. Pause auto-update on the host so the next deploy can't sneak in
        4. Note the action in /etc/okesu/incident-trail.log

      Emit an orchestration_result finding with attributes:
        contained_hosts (array of strings)
        failed_hosts (array of strings)
        actions_per_host (object: hostname → array of actions taken)
---

Config Drift Remediation (T2)

T2

Detect drift from baseline (file integrity, package versions, sysctl), classify, and either rollback automatically or open a case for review.

When to use You have an authoritative baseline (e.g. CIS, your own gold image) and want drift to surface as actionable findings, not noise.
What to notice
  • Branches on the agent's orchestration_result.classification — "approved-change" closes the finding; "unknown-drift" escalates with a containment plan.
  • Demonstrates {{stepN.result.field}} binding for downstream steps to read structured payloads.
  • Shows how a single playbook handles both "auto-correct trivial drift" and "page a human for serious drift" — same spec, different paths.
t2-drift-remediation.yaml click to expand
---
name: t2-drift-remediation
description: Tier-2 configuration-drift response. When the FIM (file integrity monitor) flags an unsanctioned change, classifies the change source (sanctioned automation vs unknown), checks for lateral spread to neighbour hosts, and pauses for operator approval before reverting.

trigger:
  on: finding
  filter: "finding.agent == 'instance-integrity' && finding.severity in ['MEDIUM','HIGH','CRITICAL']"

inputs:
  host:
    type: string
    required: false
    default: ""
  finding_id:
    type: int
    required: false
    default: 0

defaults:
  timeout: 10m

steps:
  # 1. Source attribution. A FIM alert needs context: was this an
  #    ansible apply run, a package update from `unattended-upgrades`,
  #    a sysadmin SSH session, or something we don't recognise?
  - id: attribute
    agent: investigator
    node: "{{trigger.host}}"
    actions:
      - update_finding_status
      - set_finding_severity_override
      - add_finding_tag
      - link_run_to_finding
    prompt: |
      Attribute the FIM change for finding #{{trigger.finding_id}} on {{trigger.host}}.

      Pull from the finding:
        - changed_path (file or dir that drifted)
        - change_kind (created|modified|deleted|perms|owner)
        - sha256_before, sha256_after (if recorded)
        - mtime_before, mtime_after

      Cross-reference:
        1. /var/log/dpkg.log or /var/log/dnf.log for package operations
           in the same minute window
        2. /var/log/auth.log or journalctl for sshd accept events tied to
           the same window
        3. /etc/ansible/facts.d or /var/log/ansible.log for last-run id
        4. /var/log/cloud-init-output.log for image-bake activity

      Decide:
        - source: one of `package_manager` | `ansible` | `cloud_init` | `human_ssh` | `unknown`
        - sanctioned: bool — true when source is package_manager / ansible / cloud_init
        - actor: string — username or automation identifier when known
        - confidence: low|medium|high

      Emit an orchestration_result finding with attributes:
        source, sanctioned, actor, confidence,
        change_summary (string, ≤200 chars)

  # 2. Lateral check. If the change was unsanctioned, check whether
  #    the same path drifted on neighbour hosts in the last 24h —
  #    detects mass-config-poisoning attempts.
  - id: lateral
    when: "{{attribute.result.sanctioned == false}}"
    agent: threat-hunter
    nodes:
      - threat-rocky-1
      - threat-fedora-1
      - threat-debian-1
      - edr-rocky-1
      - edr-fedora-1
      - edr-debian-1
      - fim-rocky-1
      - fim-fedora-1
      - sre-rocky-1
      - sre-fedora-1
      - mixed-east-1
      - mixed-west-1
    timeout: 6m
    actions:
      - update_finding_status
      - set_finding_severity_override
      - add_finding_tag
      - link_run_to_finding
    continue_on_error: true
    prompt: |
      Look for the same drift on this host.

      Reference change:
        path: {{trigger.attributes.changed_path}}
        sha256_after: {{trigger.attributes.sha256_after}}
        first observed on: {{trigger.host}}

      Check:
        - Does the file exist at the same path on this host?
        - Does its sha256 match the after-hash from the source host?
        - When was it last modified?
        - Are there matching entries in this host's local FIM state DB
          (/var/lib/okesu/instance-integrity/state)?

      Emit an orchestration_result finding with attributes:
        present (bool)
        sha256_matches (bool)
        modified_at (string — RFC3339)
        likely_lateral (bool — present + matches + modified within last 24h)

  # 3. Build the remediation plan. Operator gate enforced.
  - id: plan
    when: "{{attribute.result.sanctioned == false}}"
    agent: incident-responder
    timeout: 4m
    actions:
      - update_finding_status
      - set_finding_severity_override
      - add_finding_tag
      - link_run_to_finding
    prompt: |
      Plan remediation for the unsanctioned drift.

      Source attribution: {{attribute.result | json}}
      Lateral spread:     {{lateral.byNode | json}}

      Lateral hits ({{lateral.findings | length}}): {{lateral.findings | json}}

      Build a plan covering:
        1. Revert: restore the original file from the FIM's pre-change state
           (FIM keeps the prior bytes when sha256_before is recorded).
        2. Account hygiene: rotate any credential plausibly tied to the
           unauthorised actor (SSH keys, sudoers, app secrets).
        3. Spread cleanup: same revert procedure on each lateral hit.
        4. Forensics: what to capture before reverting (stat output,
           xattrs, timestamps, parent-process tree if still running).

      Emit an orchestration_result finding with attributes:
        revert_targets (array of {host: string, path: string})
        forensics_capture (array of strings — commands to run pre-revert)
        cred_rotation_needed (array of strings — keys/secrets to rotate)
        risk_assessment (string — what could go wrong if revert breaks something)

  # 4. Operator-gated execution. The plan above is read-only;
  #    only approval moves us into making changes on the fleet.
  - id: revert
    approval: required
    when: "{{attribute.result.sanctioned == false}}"
    agent: incident-responder
    timeout: 15m
    actions:
      - update_finding_status
      - add_finding_tag
      - link_run_to_finding
    prompt: |
      Execute the approved revert plan.

      Plan: {{plan.result | json}}

      Procedure per target:
        1. Run forensics_capture commands; save output under
           /var/lib/okesu/incidents/{{trigger.finding_id}}/<host>/
        2. Restore the file from the FIM pre-change state (use the
           sha256_before to verify the restored bytes are correct).
        3. Reload any service that reads the file (systemctl reload <svc>).
        4. Re-trigger an immediate FIM tick on the host to confirm the
           revert closed the finding.

      Skip credential rotation in this orchestration — that's a
      separate, deliberately operator-driven flow.

      Emit an orchestration_result finding with attributes:
        reverted_hosts (array of string)
        failed_hosts (array of string)
        forensics_paths (array of string)
        operator_summary (string)
        actions (array)

      Actions to request on finding #{{trigger.finding_id}}:
        - if reverted_hosts contains the source host AND failed_hosts is empty:
            update_finding_status → resolved
              (reason: "drift reverted; FIM re-tick confirmed clean")
            add_finding_tag → auto-reverted
            link_run_to_finding
        - if any failed_hosts:
            add_finding_tag → revert-partial
            link_run_to_finding
            (do not change status — the operator owns the partial-failure call)
---

Cert Expiry Rotation (T2)

T2

When a TLS certificate finding lands, find every host using that cert, plan the rotation, and gate the rollout.

When to use Cert expiry findings from `sre-health` shouldn't live as silent tickets — this orchestrates the discovery-plan-approve-rotate cycle in hours, not weeks.
What to notice
  • Step 1 inspects the affected cert, then step 2 fan-outs across the fleet to find every other host that ships the same cert (subject + serial match).
  • Step 3 drafts an ordered rotation plan; step 4 (approval: required) blocks until reviewed; step 5 executes against the approved hosts.
  • Production-shaped: every state-mutating step has an approval gate and an explicit allowlist — nothing rotates without human consent.
t2-cert-expiry-rotation.yaml click to expand
---
name: t2-cert-expiry-rotation
description: Tier-2 cert lifecycle automation. Runs daily, sweeps every node's mTLS + webhook + jobs runtime certs, auto-rotates anything expiring in <30 days on lab/dev hosts, and pauses for operator approval before rotating production certs.

trigger:
  on: cron
  cron: "0 4 * * *"   # 04:00 UTC daily

inputs:
  rotate_threshold_days:
    type: int
    required: false
    default: 30
  prod_label:
    type: string
    required: false
    default: "production"

defaults:
  timeout: 15m

steps:
  # 1. Fan-out audit: every node reports its cert expiry windows.
  #    Read-only; no rotation here. continue_on_error so a single
  #    unreachable host doesn't kill the sweep.
  - id: audit
    agent: investigator
    nodes:
      - threat-rocky-1
      - threat-rocky-2
      - threat-fedora-1
      - threat-fedora-2
      - threat-debian-1
      - threat-debian-2
      - edr-rocky-1
      - edr-rocky-2
      - edr-fedora-1
      - edr-fedora-2
      - edr-fedora-3
      - edr-debian-1
      - edr-debian-2
      - edr-debian-3
      - edr-debian-4
      - fim-debian-1
      - fim-rocky-1
      - fim-rocky-2
      - fim-fedora-1
      - fim-fedora-2
      - sre-debian-1
      - sre-debian-2
      - sre-rocky-1
      - sre-rocky-2
      - sre-fedora-1
      - sre-fedora-2
      - sre-fedora-3
      - mixed-east-1
      - mixed-west-1
      - mixed-west-2
    continue_on_error: true
    timeout: 5m
    actions:
      - update_finding_status
      - set_finding_severity_override
      - add_finding_tag
      - link_run_to_finding
    prompt: |
      Audit cert expiry on this host.

      Inspect (use openssl when possible):
        - /etc/okesu/*-mgmt-certs/client.crt — daimon mgmt-plane cert
        - /etc/okesu/node-certs/client.crt   — jobs / tunnel runtime cert
        - /etc/ssl/certs/* and /etc/letsencrypt/live/*/cert.pem — webhook cert if present

      For each cert: extract `notAfter`, compute days_until_expiry.

      Read /etc/okesu/labels for any `production=true` or `env=prod` line so the
      heatmap can flag prod hosts.

      Emit an orchestration_result finding with attributes:
        is_production (bool)
        certs (array of {path: string, days: int, subject: string})
        soonest_days (int)

  # 2. Build the rotation plan from the audit. Hosts with
  #    soonest_days < threshold are candidates; production hosts go
  #    behind the gate, lab hosts auto-rotate.
  - id: plan
    agent: investigator
    timeout: 3m
    actions:
      - update_finding_status
      - set_finding_severity_override
      - add_finding_tag
      - link_run_to_finding
    prompt: |
      Build a rotation plan from the audit.

      Per-host: {{audit.byNode | json}}
      Threshold: {{trigger.rotate_threshold_days}} days

      Partition into:
        auto_rotate: lab/dev hosts (is_production=false) with soonest_days < threshold
        gated_rotate: production hosts with soonest_days < threshold
        skip:        soonest_days >= threshold

      For each rotation entry, list the cert paths needing renewal.

      Emit an orchestration_result finding with attributes:
        auto_rotate (array of {host: string, paths: array of string, days: int})
        gated_rotate (array of {host: string, paths: array of string, days: int})
        skip_count (int)

  # 3. Auto-rotate the lab/dev set. The CP issues a fresh
  #    node-cert (same flow the manual install endpoint uses).
  - id: rotate_lab
    when: "{{plan.result.auto_rotate | length > 0}}"
    agent: investigator
    timeout: 8m
    actions:
      - update_finding_status
      - set_finding_severity_override
      - add_finding_tag
      - link_run_to_finding
    prompt: |
      Rotate certs on the auto-tier hosts.

      Targets: {{plan.result.auto_rotate | json}}

      For each entry:
        1. Call the CP's POST /api/nodes/{id}/issue-cert endpoint
           (auth via the orchestration's CP-side runner — same path
           the deploy flow uses).
        2. Drop the new cert + key into the recorded paths
           (atomic mv; preserve permissions).
        3. Reload the daimon: `systemctl reload okesu-agent-<name>` (or flavour equiv).
        4. Verify mgmt-plane connectivity within 60s of reload.

      Emit an orchestration_result finding with attributes:
        rotated_hosts (array of string)
        failed_hosts (array of string)
        elapsed_seconds (int)

  # 4. Operator-gated production rotation. Same procedure, just
  #    waits for the human "go". The brief gives the operator
  #    everything they need to approve confidently.
  - id: gate_prod
    approval: required
    when: "{{plan.result.gated_rotate | length > 0}}"
    agent: incident-responder
    timeout: 5m
    actions:
      - update_finding_status
      - set_finding_severity_override
      - add_finding_tag
      - link_run_to_finding
    prompt: |
      Production cert rotation needs approval.

      Targets: {{plan.result.gated_rotate | json}}

      Write a one-page brief covering:
        - Window: pick a low-traffic 30m window in the next 24h based on the
          nodes' previously-recorded peak hours (read /var/log/okesu/access patterns
          if available; otherwise default to 02:00-02:30 host-local).
        - Rollback: restore from the .previous backup the rotation script
          will leave behind. Estimate restore time per host.
        - Blast radius: which services on each host depend on the cert.

      Emit an orchestration_result finding with attributes:
        proposed_window_utc (string — RFC3339)
        per_host_plan (array of {host: string, services: array of string, downtime_estimate_s: int})
        rollback_steps (array of string)

  # 5. Production rotation, post-approval.
  - id: rotate_prod
    when: "{{plan.result.gated_rotate | length > 0}}"
    agent: investigator
    timeout: 30m
    actions:
      - update_finding_status
      - set_finding_severity_override
      - add_finding_tag
      - link_run_to_finding
    prompt: |
      Execute the prod rotation plan.

      Plan: {{gate_prod.result.per_host_plan | json}}
      Window: {{gate_prod.result.proposed_window_utc}}

      For each host:
        1. Pre-snapshot existing cert dir
        2. Reissue + drop new cert (same procedure as rotate_lab)
        3. Reload service
        4. Verify mTLS connectivity restored
        5. If verify fails within 90s: roll back from snapshot, mark host failed

      Emit an orchestration_result finding with attributes:
        rotated_hosts (array of string)
        rolled_back_hosts (array of string)
        elapsed_seconds (int)
        operator_summary (string — one paragraph for the audit log)
---

Roll your own

All six of these started life as a regular orchestration draft in the visual editor. The YAML you see here is what the editor serializes on save. To build one yourself, head to Automation → Orchestrations → New in the platform, drop a handful of agents on the canvas, wire them together, and click Save.