EDR Critical Response
T2Auto-investigate HIGH/CRITICAL EDR findings, hunt the fleet, and pause for operator approval before containment.
- Auto-fires on a finding match — no human in the loop until the approval gate at step 4.
- A
when:gate skips binary analysis when the triage step did not pinpoint a binary — keeps cheap when there's nothing to dissect. - Step 3 fans out across three host classes (
threat-rocky-1,threat-fedora-2,threat-debian-2) to look for the same IOCs across the fleet in parallel. - Step 4 is
approval: required— the run pauses until an operator clicks Approve in the dashboard.
edr-critical-response.yaml click to expand
---
name: edr-critical-response
description: Auto-investigate HIGH or CRITICAL EDR findings, hunt the fleet for related artifacts, and pause for operator approval before containment.
# Auto-fire whenever the eventpipeline projects a HIGH or CRITICAL
# finding from the `edr` daimon. The matching finding's fields land
# on `{{trigger.*}}` for the steps to template against.
trigger:
on: finding
filter: "finding.severity in ['HIGH', 'CRITICAL'] && finding.agent == 'edr'"
# Inputs let an operator run this manually for testing — supplying a
# host + finding_id by hand. When the auto-trigger fires, the matching
# finding's fields populate {{trigger.*}} and these defaults aren't
# consulted (auto-trigger payloads carry the real values).
inputs:
host:
type: string
required: false
default: "threat-rocky-1"
finding_id:
type: int
required: false
default: 0
defaults:
timeout: 5m
steps:
# 1. Triage on the affected host. We only need a single agent run
# here — the investigator agent reads the finding context and
# pulls relevant evidence (process tree, recent network, file
# changes).
- id: triage
agent: investigator
node: "{{trigger.host}}"
actions:
- update_finding_status
- set_finding_severity_override
- add_finding_tag
- link_run_to_finding
prompt: |
Investigate finding #{{trigger.finding_id}} on {{trigger.host}}.
Severity: {{trigger.severity}}
Title: {{trigger.title}}
Resource: {{trigger.resource}}
Dedup key: {{trigger.dedup_key}}
Build a 2-minute-window timeline of process / network / file
events. If the finding pinpoints a binary (path, sha256), emit
an orchestration_result finding with attributes:
sha256, path, cmdline, parent_pid, network_peers (array)
# 2. Analyze the binary if triage extracted one. The `when` gate
# skips the step when there's nothing to analyse.
- id: analyze
when: "{{triage.result.sha256 != ''}}"
agent: binary-analyzer
node: "{{trigger.host}}"
timeout: 10m
actions:
- update_finding_status
- set_finding_severity_override
- add_finding_tag
- link_run_to_finding
prompt: |
Analyze the binary at {{triage.result.path}} (sha256
{{triage.result.sha256}}).
Static analysis only — strings, imports, entropy, packers, IOCs.
Emit an orchestration_result finding with attributes:
family, confidence (low/medium/high), iocs (array of
{kind, value} entries), persistence (string)
# 3. Hunt the fleet. Multi-host fan-out: the same hunt prompt runs
# in parallel on three of our most representative hosts, and
# findings are merged. Operators editing this in the visual
# editor see "3 nodes (fan-out)" on the card.
- id: hunt
agent: threat-hunter
nodes:
- threat-rocky-1
- threat-fedora-2
- threat-debian-2
timeout: 8m
actions:
- update_finding_status
- set_finding_severity_override
- add_finding_tag
- link_run_to_finding
continue_on_error: true # one host's hunt failing doesn't kill the chain
prompt: |
Hunt for the IOCs from the previous step on this host.
Target sha256: {{analyze.result.sha256}}
Family: {{analyze.result.family}}
Other IOCs: {{analyze.result.iocs | json}}
Look at process listings, recent shell history, persistence
mechanisms (systemd units, cron, .bashrc), open network
connections, and the last 6 hours of relevant log entries.
Emit an orchestration_result finding with attributes:
matches (count), evidence (array of strings), additional_hosts (array)
# 4. Containment plan, gated. The amber pulse on this card in the
# canvas tells the operator they need to click Approve before
# the incident-responder agent dispatches.
- id: respond
approval: required
agent: incident-responder
node: "{{trigger.host}}"
timeout: 15m
actions:
- update_finding_status
- add_finding_tag
- link_run_to_finding
prompt: |
Draft a containment plan for finding {{trigger.finding_id}}.
Triage summary (last 50 lines):
{{triage.output | tail(50)}}
Binary analysis result:
{{analyze.result | json}}
Fleet hunt found {{hunt.findings | length}} related artifact(s)
across {{hunt.nodes | length}} host(s):
{{hunt.findings | json}}
Per-host hunt detail:
- threat-rocky-1: {{hunt.byNode["threat-rocky-1"].findings | length}} matches
- threat-fedora-2: {{hunt.byNode["threat-fedora-2"].findings | length}} matches
- threat-debian-2: {{hunt.byNode["threat-debian-2"].findings | length}} matches
Output a written plan covering:
1. Immediate isolation (SG / firewall / kubectl cordon)
2. Evidence preservation (hashes, paths, command list)
3. Eradication (specific files / accounts / persistence to remove)
4. Validation steps and rollback procedure if anything breaks
Then emit an orchestration_result finding with attributes:
plan (string — the markdown above)
actions (array)
Actions to request on finding {{trigger.finding_id}}:
update_finding_status → investigating
(reason: "containment plan ready: <one-line summary>")
add_finding_tag → containment-planned
link_run_to_finding
--- Finding Autotriage (T1)
T1Generic Tier-1 triage that runs on every new finding, attaches an evidence summary, and decides whether to escalate.
- No
when:gate — runs on everything matching the trigger filter; cheap by design. - Uses
actions:allowlist so the agent can update finding status, set severity overrides, and add tags without an approval gate. - Demonstrates the platform's default action-class policy: only what's explicitly listed is permitted.
t1-finding-autotriage.yaml click to expand
---
name: t1-finding-autotriage
description: Tier-1 auto-triage fast-lane for HIGH/CRITICAL findings. Decides real-issue vs known noise; suppresses noise, summarises the rest, and escalates when the finding survives both checks. INFO/LOW findings are handled in batches by `t1-finding-batch-triage` instead — per-finding triage there saturated the API under noise bursts.
# Fires only on HIGH/CRITICAL — the per-finding fast-lane. INFO/LOW
# go through the batched `t1-finding-batch-triage` (cron 5min) so we
# don't pay one LLM startup per noise event.
trigger:
on: finding
filter: "finding.severity in ['HIGH', 'CRITICAL']"
inputs:
host:
type: string
required: false
default: ""
finding_id:
type: int
required: false
default: 0
defaults:
timeout: 5m
steps:
# 1. Classify: noise vs real. The investigator agent reads the
# finding context, checks the recent fleet pattern (is this the
# same finding firing on every host? is the source a known
# scanner / monitoring system?), and emits a verdict +
# requests CP-side actions to mutate the finding accordingly.
- id: classify
agent: investigator
node: "{{trigger.host}}"
actions:
- update_finding_status
- set_finding_severity_override
- add_finding_tag
- link_run_to_finding
prompt: |
Classify finding #{{trigger.finding_id}} on {{trigger.host}}.
Severity: {{trigger.severity}}
Title: {{trigger.title}}
Agent: {{trigger.agent}}
Resource: {{trigger.resource}}
Dedup key: {{trigger.dedup_key}}
Decide:
- verdict: one of `noise` | `confirmed` | `unknown`
- reasoning: one short sentence
- suppress_pattern: glob to suppress future identical findings, or empty
- escalate: bool — set true ONLY when verdict=confirmed AND severity in (HIGH, CRITICAL)
Heuristics for `noise`:
- Same finding fired on >=5 hosts in last 60 min with identical title
- Source attributes match a known internal scanner / monitor IP
- The change recorded matches a sanctioned automation (ansible run id, package manager update)
- Self-reported finding from the agent that itself deployed (collector seeing its own writes)
Emit an orchestration_result finding with attributes:
verdict (string), reasoning (string), suppress_pattern (string),
escalate (bool), actions (array — see below).
Actions to request:
- verdict=noise:
update_finding_status → false_positive (reason: short noise reason)
set_finding_severity_override → INFO
add_finding_tag → auto-triaged-noise
link_run_to_finding
- verdict=confirmed:
add_finding_tag → auto-confirmed
link_run_to_finding
- verdict=unknown:
add_finding_tag → needs-human
link_run_to_finding
The full action protocol is at agents/_orchestration-actions.md.
# 2. Auto-suppress when classified as noise. The action runs in the
# same node as the finding came from so its scope is local; a
# fleet-wide suppression would be an explicit T2 step.
- id: auto_suppress
when: "{{classify.result.verdict == 'noise' && classify.result.suppress_pattern != ''}}"
agent: investigator
node: "{{trigger.host}}"
timeout: 2m
actions:
- update_finding_status
- set_finding_severity_override
- add_finding_tag
- link_run_to_finding
prompt: |
Apply local suppression for finding #{{trigger.finding_id}}.
Pattern: `{{classify.result.suppress_pattern}}`
Reason: {{classify.result.reasoning}}
Add the pattern to /etc/okesu/suppressions.yml on this host (create if absent), under
a `local:` block keyed by today's ISO date so we can audit later.
Emit an orchestration_result finding with attributes:
applied (bool), suppression_path (string), entries_added (int)
# 3. Confirmed-real summary. When `escalate` is true, write a
# short operator-readable summary so the human opening the
# finding sees an executive answer rather than a raw event log.
- id: summarize
when: "{{classify.result.escalate == true}}"
agent: investigator
node: "{{trigger.host}}"
timeout: 3m
actions:
- update_finding_status
- set_finding_severity_override
- add_finding_tag
- link_run_to_finding
prompt: |
Write a one-paragraph operator brief for finding #{{trigger.finding_id}}.
Verdict: confirmed real
Reasoning: {{classify.result.reasoning}}
Keep it ≤120 words. Cover:
- What happened (what changed / fired / observed)
- Blast radius (this host? cluster? fleet?)
- Recommended next action (1-2 bullets)
- Confidence level
Emit an orchestration_result finding with attributes:
brief (string), blast_radius (string: host|cluster|fleet),
confidence (string: low|medium|high)
--- Failed-Login Noise Dedup (T1)
T1Bundle authentication-failure noise into one incident; only escalate when a source actually authenticated successfully or hit a sensitive account.
- A precise
trigger.filter:keeps the orchestration from firing on findings that don't mention auth. - Step 1 collects the 60-minute window of failed logins and aggregates by source IP — the orchestration moves the burden of "is this brute-force or background radiation" onto the agent.
- Step 2 escalates only when conditions hold (success after a streak of failures, or a sensitive account targeted).
t1-failed-login-noise-dedup.yaml click to expand
---
name: t1-failed-login-noise-dedup
description: Tier-1 dedup of authentication-failure noise. Distinguishes brute-force attempts from internet background radiation (vuln scanners, mass DDoS sweeps), bundles repeats into a single SEV-3 incident, and only escalates when the same source actually authenticated successfully or hit a sensitive account.
trigger:
on: finding
filter: "finding.title contains 'auth' && (finding.title contains 'fail' || finding.title contains 'invalid')"
inputs:
host:
type: string
required: false
default: ""
finding_id:
type: int
required: false
default: 0
defaults:
timeout: 4m
steps:
# 1. Source attribution — collect the source IPs across recent
# failures, classify each, and decide if any pattern looks like
# real brute force vs background scanner traffic.
- id: triage
agent: investigator
node: "{{trigger.host}}"
actions:
- update_finding_status
- set_finding_severity_override
- add_finding_tag
- link_run_to_finding
prompt: |
Triage failed-login finding #{{trigger.finding_id}} on {{trigger.host}}.
Gather:
- `journalctl -u sshd --since "60 minutes ago" | grep -iE "failed|invalid"` (or /var/log/auth.log)
- Aggregate by source IP: count, first seen, last seen, account targeted
- Look up each source IP: is it on a known threat-intel feed?
(Use only what's locally available — no external lookups.)
- Check for any SUCCESSFUL login from those source IPs in the same window:
`journalctl -u sshd --since "60 minutes ago" | grep -i "accepted"`
Classify each source:
- `scanner` — high-volume, low-effort, hits common usernames (root, admin, oracle), no success
- `targeted` — focused on a specific real account, slower cadence, sometimes succeeds
- `unknown` — needs human eyes
Verdict:
- severity: one of `noise` | `elevated` | `incident`
- noise: all sources are scanners, no successful logins
- elevated: at least one targeted source, no success
- incident: any successful login from a flagged source
- sources_count, attempts_total, accounts_hit (array)
Emit an orchestration_result finding with attributes:
severity (string)
sources_count (int)
attempts_total (int)
accounts_hit (array of strings)
has_successful_login (bool)
worst_source (string — IP)
rollup_window (string — e.g. "60m")
# 2. Auto-bundle scanner noise. Writes a single dedup'd finding
# in place of N raw ones, suppresses the rest of the bundle
# for 24h via local fail2ban-style hosts.deny entry.
- id: deduplicate
when: "{{triage.result.severity == 'noise'}}"
agent: investigator
node: "{{trigger.host}}"
timeout: 3m
actions:
- update_finding_status
- set_finding_severity_override
- add_finding_tag
- link_run_to_finding
prompt: |
Bundle the noise on {{trigger.host}}.
The triage classified the {{triage.result.sources_count}} sources as
scanners. Apply automatic mitigation:
- If iptables/nftables present and a "okesu-scanners" chain exists,
add the source IPs to it with a 24h timeout
- Otherwise, append to /etc/hosts.deny for the next 24h with comment
`# okesu T1 scanner-noise YYYY-MM-DD`
- Roll the {{triage.result.attempts_total}} raw findings into one
summary finding tagged `noise-bundled`
Emit an orchestration_result finding with attributes:
ips_blocked (int)
block_method (string: iptables|nftables|hosts.deny|none)
bundle_id (string)
actions (array)
Actions to request on the source finding #{{trigger.finding_id}}:
update_finding_status → false_positive
(reason: "scanner noise; bundled into <bundle_id>")
set_finding_severity_override → INFO
add_finding_tag → auto-triaged-noise
add_finding_tag → noise-bundled
link_run_to_finding
# 3. Escalate when something targeted the host. The on-call sees
# the brief inline rather than digging through 200 raw findings.
- id: escalate_brief
when: "{{triage.result.severity != 'noise'}}"
agent: incident-responder
node: "{{trigger.host}}"
timeout: 4m
actions:
- update_finding_status
- set_finding_severity_override
- add_finding_tag
- link_run_to_finding
prompt: |
Write an on-call brief for the auth-failure incident on {{trigger.host}}.
Triage said:
severity={{triage.result.severity}}
sources={{triage.result.sources_count}}
attempts={{triage.result.attempts_total}}
accounts={{triage.result.accounts_hit}}
had_success={{triage.result.has_successful_login}}
worst_source={{triage.result.worst_source}}
Cover:
- One-paragraph timeline of what happened
- Whether containment is needed NOW or can wait until business hours
- Recommended actions, ordered by impact
(key rotation? fail2ban? service-account audit? IR playbook?)
Keep it ≤200 words. Emit an orchestration_result finding with attributes:
urgency (string: low|medium|high)
recommended_actions (array of strings)
suggested_severity (string: SEV-2|SEV-3|SEV-4)
--- Fleet IOC Hunt (T2)
T2Fan out across every host in the fleet to hunt for a confirmed IOC, then aggregate.
- Pure fan-out shape: one step,
nodes: [...], each host runs the same prompt with the same IOC inputs. continue_on_error: true— one host's hunt failing (offline, daimon stale) doesn't kill the run; the synthesizer step accepts partial results.- The post-fan-out summarizer reads
{{stepN.byNode["host-1"].findings}}to attribute matches per host.
t2-fleet-ioc-hunt.yaml click to expand
---
name: t2-fleet-ioc-hunt
description: Tier-2 fleet-wide IOC hunt. Triggered when any finding surfaces a usable indicator (sha256, IP, domain, key fingerprint). Fans out to every reachable host of the same OS family, hunts the IOC, builds a heatmap, and pauses for operator approval before any containment action.
trigger:
on: finding
filter: "(finding.attributes.sha256 != '') || (finding.attributes.ioc != '')"
inputs:
ioc:
type: string
required: false
default: ""
ioc_kind:
type: string
required: false
default: "sha256" # one of sha256|ipv4|domain|ssh_pubkey
source_host:
type: string
required: false
default: ""
defaults:
timeout: 12m
steps:
# 1. Confirm the IOC and pick the hunt scope. Different IOC kinds
# point at different host populations — a sha256 spreads via
# package/payload (so all hosts of the same OS), an SSH pubkey
# spreads via provisioning (so all hosts using the same key
# template), an IP/domain via outbound calls (any host).
- id: scope
agent: investigator
node: "{{trigger.source_host}}"
actions:
- update_finding_status
- set_finding_severity_override
- add_finding_tag
- link_run_to_finding
data:
ioc:
query: iocs.lookup
params: { kind: "{{trigger.ioc_kind}}", value: "{{trigger.ioc}}" }
prompt: |
Confirm IOC and define hunt scope.
Source finding's host: {{trigger.source_host}}
IOC: `{{data.ioc.normalized_value}}` (kind={{data.ioc.kind}})
Catalog metadata: source={{data.ioc.source}} attribution={{data.ioc.attribution}} severity_floor={{data.ioc.severity_floor}}
Attributes from triggering finding: {{trigger.attributes | json}}
Decide hunt scope:
- target_population: one of `same_os`, `all`, `web_tier`, `db_tier`
- explanation: one sentence why
- max_hosts: cap parallel fan-out (default 20)
Emit an orchestration_result finding with attributes:
valid (bool, copy from {{data.ioc.valid}})
normalized_ioc (string, copy from {{data.ioc.normalized_value}})
target_population (string)
explanation (string)
max_hosts (int)
# 2. Fan out to representative hosts. The operator's lab uses
# these names; a real deployment would either inject node
# selectors via the inputs block or read them from a tag query
# (when the engine adds tag selectors).
- id: hunt
when: "{{scope.result.valid == true}}"
agent: threat-hunter
nodes:
- threat-rocky-1
- threat-rocky-2
- threat-fedora-1
- threat-fedora-2
- threat-debian-1
- threat-debian-2
- edr-rocky-1
- edr-fedora-1
- edr-debian-1
- fim-debian-1
- fim-rocky-1
- fim-fedora-1
- sre-debian-1
- sre-rocky-1
- sre-fedora-1
- mixed-east-1
- mixed-west-1
timeout: 8m
actions:
- update_finding_status
- set_finding_severity_override
- add_finding_tag
- link_run_to_finding
continue_on_error: true
prompt: |
Hunt for IOC `{{scope.result.normalized_ioc}}` ({{trigger.ioc_kind}}) on this host.
For sha256:
- find / -type f -size +1k -exec sha256sum {} + 2>/dev/null | grep -F "{{scope.result.normalized_ioc}}"
- rpm -qa | xargs rpm -ql 2>/dev/null (or dpkg -L for Debian) and verify integrity
- Check process memory of long-running daemons
For ipv4 / domain:
- `ss -tunap | grep {{scope.result.normalized_ioc}}` (live connections)
- `journalctl --since "24 hours ago" | grep {{scope.result.normalized_ioc}}` (logs)
- `ip route get {{scope.result.normalized_ioc}}` if v4
For ssh_pubkey:
- Search ~/.ssh/authorized_keys, /root/.ssh/authorized_keys
- Check /etc/ssh/sshd_config TrustedUserCAKeys
- grep -rF "{{scope.result.normalized_ioc}}" /etc/ssh /root/.ssh 2>/dev/null
Emit an orchestration_result finding with attributes:
host_match (bool)
evidence (array of strings — paths / connections / lines)
confidence (string: low|medium|high)
first_seen (string — RFC3339 from filesystem mtime / log timestamp)
# 3. Heatmap + recommendation. Aggregate the hunt results into a
# spread map and write the on-call brief.
- id: heatmap
when: "{{scope.result.valid == true}}"
agent: incident-responder
timeout: 5m
actions:
- add_finding_tag
- link_run_to_finding
- escalate
prompt: |
Build the fleet heatmap for IOC `{{scope.result.normalized_ioc}}`.
Per-host hunt results:
{{hunt.byNode | json}}
Aggregate:
- matched_hosts (array of names)
- clean_hosts (array of names)
- errored_hosts (array of names)
- earliest_first_seen across matches
- most_common_evidence_type
Recommend a containment plan, scoped by confidence:
- confidence=high → quarantine matched hosts (firewall isolate, snapshot)
- confidence=medium → snapshot + monitor, no isolation yet
- confidence=low → keep watching, ask the operator if they recognise it
Emit an orchestration_result finding with attributes:
matched_count (int)
clean_count (int)
errored_count (int)
recommended_action (string: quarantine|snapshot|monitor|noop)
recommended_severity (string: SEV-1|SEV-2|SEV-3)
plan (string — multi-line markdown, ≤500 words)
actions (array)
Actions to request (always):
add_finding_tag → ioc-hunted (on the source finding)
link_run_to_finding (on the source finding)
If matched_count > 0 AND recommended_severity in (SEV-1, SEV-2):
escalate (reason: short summary, severity matches recommended_severity)
# 4. Operator-gated containment. Approve to actually isolate the
# matched hosts. The action is intentionally explicit — even
# inside an automated orchestration, hard isolation needs a
# human "go".
- id: contain
approval: required
when: "{{heatmap.result.recommended_action == 'quarantine'}}"
agent: incident-responder
timeout: 10m
actions:
- update_finding_status
- set_finding_severity_override
- add_finding_tag
- link_run_to_finding
prompt: |
Containment for IOC `{{scope.result.normalized_ioc}}`. Operator approved.
Targets ({{heatmap.result.matched_count}} hosts): {{heatmap.result.matched_hosts | json}}
For each target:
1. Snapshot key state if a `snapshot.sh` is present at /usr/local/bin
2. Apply network isolation:
- iptables/nftables: DROP egress except to {{trigger.source_host}}'s management plane
- macOS: pf rule via `pfctl -e`
3. Pause auto-update on the host so the next deploy can't sneak in
4. Note the action in /etc/okesu/incident-trail.log
Emit an orchestration_result finding with attributes:
contained_hosts (array of strings)
failed_hosts (array of strings)
actions_per_host (object: hostname → array of actions taken)
--- Config Drift Remediation (T2)
T2Detect drift from baseline (file integrity, package versions, sysctl), classify, and either rollback automatically or open a case for review.
- Branches on the agent's
orchestration_result.classification— "approved-change" closes the finding; "unknown-drift" escalates with a containment plan. - Demonstrates
{{stepN.result.field}}binding for downstream steps to read structured payloads. - Shows how a single playbook handles both "auto-correct trivial drift" and "page a human for serious drift" — same spec, different paths.
t2-drift-remediation.yaml click to expand
---
name: t2-drift-remediation
description: Tier-2 configuration-drift response. When the FIM (file integrity monitor) flags an unsanctioned change, classifies the change source (sanctioned automation vs unknown), checks for lateral spread to neighbour hosts, and pauses for operator approval before reverting.
trigger:
on: finding
filter: "finding.agent == 'instance-integrity' && finding.severity in ['MEDIUM','HIGH','CRITICAL']"
inputs:
host:
type: string
required: false
default: ""
finding_id:
type: int
required: false
default: 0
defaults:
timeout: 10m
steps:
# 1. Source attribution. A FIM alert needs context: was this an
# ansible apply run, a package update from `unattended-upgrades`,
# a sysadmin SSH session, or something we don't recognise?
- id: attribute
agent: investigator
node: "{{trigger.host}}"
actions:
- update_finding_status
- set_finding_severity_override
- add_finding_tag
- link_run_to_finding
prompt: |
Attribute the FIM change for finding #{{trigger.finding_id}} on {{trigger.host}}.
Pull from the finding:
- changed_path (file or dir that drifted)
- change_kind (created|modified|deleted|perms|owner)
- sha256_before, sha256_after (if recorded)
- mtime_before, mtime_after
Cross-reference:
1. /var/log/dpkg.log or /var/log/dnf.log for package operations
in the same minute window
2. /var/log/auth.log or journalctl for sshd accept events tied to
the same window
3. /etc/ansible/facts.d or /var/log/ansible.log for last-run id
4. /var/log/cloud-init-output.log for image-bake activity
Decide:
- source: one of `package_manager` | `ansible` | `cloud_init` | `human_ssh` | `unknown`
- sanctioned: bool — true when source is package_manager / ansible / cloud_init
- actor: string — username or automation identifier when known
- confidence: low|medium|high
Emit an orchestration_result finding with attributes:
source, sanctioned, actor, confidence,
change_summary (string, ≤200 chars)
# 2. Lateral check. If the change was unsanctioned, check whether
# the same path drifted on neighbour hosts in the last 24h —
# detects mass-config-poisoning attempts.
- id: lateral
when: "{{attribute.result.sanctioned == false}}"
agent: threat-hunter
nodes:
- threat-rocky-1
- threat-fedora-1
- threat-debian-1
- edr-rocky-1
- edr-fedora-1
- edr-debian-1
- fim-rocky-1
- fim-fedora-1
- sre-rocky-1
- sre-fedora-1
- mixed-east-1
- mixed-west-1
timeout: 6m
actions:
- update_finding_status
- set_finding_severity_override
- add_finding_tag
- link_run_to_finding
continue_on_error: true
prompt: |
Look for the same drift on this host.
Reference change:
path: {{trigger.attributes.changed_path}}
sha256_after: {{trigger.attributes.sha256_after}}
first observed on: {{trigger.host}}
Check:
- Does the file exist at the same path on this host?
- Does its sha256 match the after-hash from the source host?
- When was it last modified?
- Are there matching entries in this host's local FIM state DB
(/var/lib/okesu/instance-integrity/state)?
Emit an orchestration_result finding with attributes:
present (bool)
sha256_matches (bool)
modified_at (string — RFC3339)
likely_lateral (bool — present + matches + modified within last 24h)
# 3. Build the remediation plan. Operator gate enforced.
- id: plan
when: "{{attribute.result.sanctioned == false}}"
agent: incident-responder
timeout: 4m
actions:
- update_finding_status
- set_finding_severity_override
- add_finding_tag
- link_run_to_finding
prompt: |
Plan remediation for the unsanctioned drift.
Source attribution: {{attribute.result | json}}
Lateral spread: {{lateral.byNode | json}}
Lateral hits ({{lateral.findings | length}}): {{lateral.findings | json}}
Build a plan covering:
1. Revert: restore the original file from the FIM's pre-change state
(FIM keeps the prior bytes when sha256_before is recorded).
2. Account hygiene: rotate any credential plausibly tied to the
unauthorised actor (SSH keys, sudoers, app secrets).
3. Spread cleanup: same revert procedure on each lateral hit.
4. Forensics: what to capture before reverting (stat output,
xattrs, timestamps, parent-process tree if still running).
Emit an orchestration_result finding with attributes:
revert_targets (array of {host: string, path: string})
forensics_capture (array of strings — commands to run pre-revert)
cred_rotation_needed (array of strings — keys/secrets to rotate)
risk_assessment (string — what could go wrong if revert breaks something)
# 4. Operator-gated execution. The plan above is read-only;
# only approval moves us into making changes on the fleet.
- id: revert
approval: required
when: "{{attribute.result.sanctioned == false}}"
agent: incident-responder
timeout: 15m
actions:
- update_finding_status
- add_finding_tag
- link_run_to_finding
prompt: |
Execute the approved revert plan.
Plan: {{plan.result | json}}
Procedure per target:
1. Run forensics_capture commands; save output under
/var/lib/okesu/incidents/{{trigger.finding_id}}/<host>/
2. Restore the file from the FIM pre-change state (use the
sha256_before to verify the restored bytes are correct).
3. Reload any service that reads the file (systemctl reload <svc>).
4. Re-trigger an immediate FIM tick on the host to confirm the
revert closed the finding.
Skip credential rotation in this orchestration — that's a
separate, deliberately operator-driven flow.
Emit an orchestration_result finding with attributes:
reverted_hosts (array of string)
failed_hosts (array of string)
forensics_paths (array of string)
operator_summary (string)
actions (array)
Actions to request on finding #{{trigger.finding_id}}:
- if reverted_hosts contains the source host AND failed_hosts is empty:
update_finding_status → resolved
(reason: "drift reverted; FIM re-tick confirmed clean")
add_finding_tag → auto-reverted
link_run_to_finding
- if any failed_hosts:
add_finding_tag → revert-partial
link_run_to_finding
(do not change status — the operator owns the partial-failure call)
--- Cert Expiry Rotation (T2)
T2When a TLS certificate finding lands, find every host using that cert, plan the rotation, and gate the rollout.
- Step 1 inspects the affected cert, then step 2 fan-outs across the fleet to find every other host that ships the same cert (subject + serial match).
- Step 3 drafts an ordered rotation plan; step 4 (
approval: required) blocks until reviewed; step 5 executes against the approved hosts. - Production-shaped: every state-mutating step has an approval gate and an explicit allowlist — nothing rotates without human consent.
t2-cert-expiry-rotation.yaml click to expand
---
name: t2-cert-expiry-rotation
description: Tier-2 cert lifecycle automation. Runs daily, sweeps every node's mTLS + webhook + jobs runtime certs, auto-rotates anything expiring in <30 days on lab/dev hosts, and pauses for operator approval before rotating production certs.
trigger:
on: cron
cron: "0 4 * * *" # 04:00 UTC daily
inputs:
rotate_threshold_days:
type: int
required: false
default: 30
prod_label:
type: string
required: false
default: "production"
defaults:
timeout: 15m
steps:
# 1. Fan-out audit: every node reports its cert expiry windows.
# Read-only; no rotation here. continue_on_error so a single
# unreachable host doesn't kill the sweep.
- id: audit
agent: investigator
nodes:
- threat-rocky-1
- threat-rocky-2
- threat-fedora-1
- threat-fedora-2
- threat-debian-1
- threat-debian-2
- edr-rocky-1
- edr-rocky-2
- edr-fedora-1
- edr-fedora-2
- edr-fedora-3
- edr-debian-1
- edr-debian-2
- edr-debian-3
- edr-debian-4
- fim-debian-1
- fim-rocky-1
- fim-rocky-2
- fim-fedora-1
- fim-fedora-2
- sre-debian-1
- sre-debian-2
- sre-rocky-1
- sre-rocky-2
- sre-fedora-1
- sre-fedora-2
- sre-fedora-3
- mixed-east-1
- mixed-west-1
- mixed-west-2
continue_on_error: true
timeout: 5m
actions:
- update_finding_status
- set_finding_severity_override
- add_finding_tag
- link_run_to_finding
prompt: |
Audit cert expiry on this host.
Inspect (use openssl when possible):
- /etc/okesu/*-mgmt-certs/client.crt — daimon mgmt-plane cert
- /etc/okesu/node-certs/client.crt — jobs / tunnel runtime cert
- /etc/ssl/certs/* and /etc/letsencrypt/live/*/cert.pem — webhook cert if present
For each cert: extract `notAfter`, compute days_until_expiry.
Read /etc/okesu/labels for any `production=true` or `env=prod` line so the
heatmap can flag prod hosts.
Emit an orchestration_result finding with attributes:
is_production (bool)
certs (array of {path: string, days: int, subject: string})
soonest_days (int)
# 2. Build the rotation plan from the audit. Hosts with
# soonest_days < threshold are candidates; production hosts go
# behind the gate, lab hosts auto-rotate.
- id: plan
agent: investigator
timeout: 3m
actions:
- update_finding_status
- set_finding_severity_override
- add_finding_tag
- link_run_to_finding
prompt: |
Build a rotation plan from the audit.
Per-host: {{audit.byNode | json}}
Threshold: {{trigger.rotate_threshold_days}} days
Partition into:
auto_rotate: lab/dev hosts (is_production=false) with soonest_days < threshold
gated_rotate: production hosts with soonest_days < threshold
skip: soonest_days >= threshold
For each rotation entry, list the cert paths needing renewal.
Emit an orchestration_result finding with attributes:
auto_rotate (array of {host: string, paths: array of string, days: int})
gated_rotate (array of {host: string, paths: array of string, days: int})
skip_count (int)
# 3. Auto-rotate the lab/dev set. The CP issues a fresh
# node-cert (same flow the manual install endpoint uses).
- id: rotate_lab
when: "{{plan.result.auto_rotate | length > 0}}"
agent: investigator
timeout: 8m
actions:
- update_finding_status
- set_finding_severity_override
- add_finding_tag
- link_run_to_finding
prompt: |
Rotate certs on the auto-tier hosts.
Targets: {{plan.result.auto_rotate | json}}
For each entry:
1. Call the CP's POST /api/nodes/{id}/issue-cert endpoint
(auth via the orchestration's CP-side runner — same path
the deploy flow uses).
2. Drop the new cert + key into the recorded paths
(atomic mv; preserve permissions).
3. Reload the daimon: `systemctl reload okesu-agent-<name>` (or flavour equiv).
4. Verify mgmt-plane connectivity within 60s of reload.
Emit an orchestration_result finding with attributes:
rotated_hosts (array of string)
failed_hosts (array of string)
elapsed_seconds (int)
# 4. Operator-gated production rotation. Same procedure, just
# waits for the human "go". The brief gives the operator
# everything they need to approve confidently.
- id: gate_prod
approval: required
when: "{{plan.result.gated_rotate | length > 0}}"
agent: incident-responder
timeout: 5m
actions:
- update_finding_status
- set_finding_severity_override
- add_finding_tag
- link_run_to_finding
prompt: |
Production cert rotation needs approval.
Targets: {{plan.result.gated_rotate | json}}
Write a one-page brief covering:
- Window: pick a low-traffic 30m window in the next 24h based on the
nodes' previously-recorded peak hours (read /var/log/okesu/access patterns
if available; otherwise default to 02:00-02:30 host-local).
- Rollback: restore from the .previous backup the rotation script
will leave behind. Estimate restore time per host.
- Blast radius: which services on each host depend on the cert.
Emit an orchestration_result finding with attributes:
proposed_window_utc (string — RFC3339)
per_host_plan (array of {host: string, services: array of string, downtime_estimate_s: int})
rollback_steps (array of string)
# 5. Production rotation, post-approval.
- id: rotate_prod
when: "{{plan.result.gated_rotate | length > 0}}"
agent: investigator
timeout: 30m
actions:
- update_finding_status
- set_finding_severity_override
- add_finding_tag
- link_run_to_finding
prompt: |
Execute the prod rotation plan.
Plan: {{gate_prod.result.per_host_plan | json}}
Window: {{gate_prod.result.proposed_window_utc}}
For each host:
1. Pre-snapshot existing cert dir
2. Reissue + drop new cert (same procedure as rotate_lab)
3. Reload service
4. Verify mTLS connectivity restored
5. If verify fails within 90s: roll back from snapshot, mark host failed
Emit an orchestration_result finding with attributes:
rotated_hosts (array of string)
rolled_back_hosts (array of string)
elapsed_seconds (int)
operator_summary (string — one paragraph for the audit log)
--- Roll your own
All six of these started life as a regular orchestration draft in the visual editor. The YAML you see here is what the editor serializes on save. To build one yourself, head to Automation → Orchestrations → New in the platform, drop a handful of agents on the canvas, wire them together, and click Save.