How We Built a Real Security Operations Center With Open-Source Tools

A behind-the-scenes look at how we built a production SOC for a mid-sized enterprise using Wazuh, DFIR-IRIS, and a custom Python middleware — what worked, what broke, and the engineering decisions that actually mattered.

If you have ever priced a commercial SIEM or SOC platform, you already know the punch line: licensing alone can dwarf the salaries of the analysts who use it. The promise of open-source SOC tooling is obvious. The reality — making those tools behave like a coherent product — is where most projects stall.

This post is the field report. No SaaS. No cloud lock-in. Everything runs on Docker on the customer’s own servers, and total software cost is zero.


The stack at a glance

Layer Tool What it does
Detection Wazuh 4.x Collects logs, decodes them, fires alerts on suspicious patterns
Case management DFIR-IRIS Where analysts triage and resolve alerts
Integration SOC Integrator (FastAPI) The middleware that ties everything together
Threat intel VirusTotal + AbuseIPDB Enriches alerts with IOC reputation data
Paging PagerDuty Wakes up the on-call analyst at 3 AM
Automation Shuffle SOAR Workflow automation for IOCs
Log sources FortiGate, Windows AD, VMware ESXi, Sysmon The systems we are actually monitoring

Here is how the pieces talk to each other:

flowchart TD
    A["FortiGate / Windows AD / VMware / Sysmon"] -->|"syslog / agent"| B["Wazuh Manager"]
    B -->|"decoded events + alerts"| C["Wazuh Indexer"]
    C -->|"poll every 5s"| D["SOC Integrator (FastAPI)"]
    D -->|"enrich"| E["VirusTotal + AbuseIPDB"]
    D -->|"create alerts"| F["DFIR-IRIS"]
    D -->|"page on-call"| G["PagerDuty"]
    D -->|"trigger workflow"| H["Shuffle SOAR"]
    F -->|"analysts triage"| I["KPI Dashboard"]

The thing nobody tells you about open-source SOC tooling: the tools themselves are excellent, but they are not designed to work as a single product. The integrator is the layer that makes it feel like one.


Part 1 — Wazuh: from generic detection to your environment

Wazuh ships with about 3,500 rules out of the box. They cover the obvious things: brute force attempts, common malware, known-bad processes. But generic rules catch generic threats. To detect what actually matters in your environment — your privileged accounts, your IP ranges, your unusual patterns — you have to write rules of your own.

How we organize custom rules

We split rules by use-case appendix, mirroring the detection design document the customer signed off on:

wazuh_cluster/rules/
  soc-a1-ioc-rules.xml          # DNS / threat-intel hits
  soc-a2-fortigate-fw-rules.xml # Firewall events
  soc-a3-fortigate-vpn-rules.xml# VPN tunnel events
  soc-a4-windows-ad-rules.xml   # Windows authentication
  soc-b1-vmware-rules.xml       # vCenter / ESXi
  soc-b2-logmon-rules.xml       # Log-loss monitoring
  soc-c1-c3-rules.xml           # Multi-stage correlations
  soc-ioc-cdb-rules.xml         # Threat-intel list lookups

Every rule lives in two flavours: a simulation version (safe to fire in a test environment, IDs in the 100xxx band) and a production version (tuned for real traffic, IDs in the 110xxx band). Both flavours sit in the same file so a single test session validates both.

<!-- Simulation: fires on JSON events from our test simulator -->
<rule id="100341" level="10">
  <if_sid>110350</if_sid>
  <description>A4-01 [SIM] Windows: privileged account auth failure</description>
  <group>soc_sim,a4,windows,auth,</group>
</rule>

<!-- Production: matches real Windows event log fields -->
<rule id="110341" level="10">
  <if_group>windows</if_group>
  <field name="win.eventdata.targetUserName" type="pcre2">(?i)(admin|adm_|svc_|sa_|-adm)</field>
  <field name="win.system.eventID">^4625$</field>
  <description>A4-01 [PROD] Privileged account auth failure</description>
  <group>soc_prod,a4,windows,auth,</group>
  <mitre><id>T1110</id></mitre>
</rule>

A small detail that catches everyone: the location field

When a device sends syslog directly to Wazuh (no agent installed), Wazuh sets agent.name = wazuh.manager and stores the source IP in a field called location. This is barely documented, and we lost a few hours to it before the penny dropped.

For VMware ESXi, which only supports syslog, you have to filter by IP in location:

<rule id="110401" level="12">
  <if_group>vmware</if_group>
  <field name="location" type="pcre2">^172\.16\.0\.(107|108|109|110)$</field>
  <match>Login failure</match>
  <description>B1-01 [PROD] vCenter login failure</description>
  <mitre><id>T1110</id></mitre>
</rule>

Without that filter, any device sending the string "Login failure" would trigger the rule. Scoping it to four ESXi hosts kills the false positives.

Chaining to built-in rules instead of fighting them

ESXi SSH events are parsed by Wazuh’s sshd decoder, not the vmware decoder. So if_group=vmware cannot catch them. The right move is to inherit from the rule that already fires:

<rule id="110404" level="10">
  <if_sid>5715</if_sid>  <!-- built-in SSH success rule -->
  <field name="location" type="pcre2">^172\.16\.0\.(107|108|109|110)$</field>
  <description>B1-04 [PROD] ESXi SSH success — verify authorized</description>
  <mitre><id>T1021.004</id></mitre>
</rule>

Rule 5715 has already done the parsing. We just narrow it to our hosts.

Threat intel via CDB lists

Wazuh’s CDB (Constant Database) lookup checks a field against a flat file in O(log n). We maintain three lists — malicious IPs, malicious domains, malware hashes — and the integrator refreshes them every 4 hours from VirusTotal and AbuseIPDB, only including IOCs with confirmed hits in the past 30 days.

<rule id="110600" level="13">
  <if_sid>22101</if_sid>
  <list field="data.srcip" lookup="match_key">etc/lists/malicious-ioc/malicious-ip</list>
  <description>FortiGate source IP matched threat-intel list</description>
  <mitre><id>T1071</id></mitre>
</rule>

The Docker gotcha that cost us hours

Wazuh’s official Docker image stores rules in a named volume mounted at /var/ossec/etc. If you try to update a rule file with docker cp or sed -i, it fails silently or with "Device or resource busy." The named volume’s inode wins.

The only reliable way to edit rules in place — without restarting the container — is to write through Python’s open().write() from inside the container:

docker exec wazuh.manager python3 -c "
with open('/var/ossec/etc/rules/soc-a4-windows-ad-rules.xml', 'w') as f:
    f.write('''<group name=\"soc_mvp,...\">...</group>''')
"
docker exec wazuh.manager /var/ossec/bin/wazuh-control reload

Every CI/CD pipeline that touches Wazuh rules in production should know this.


Part 2 — The integrator: the layer that makes it a product

This is the piece that makes Wazuh, IRIS, and the other tools feel like one system instead of four. It is a FastAPI service. It does five things.

1. Alert synchronization

Every 5 seconds, the integrator queries Wazuh’s indexer for new events matching production rule IDs (rule.id:[110301 TO 110602]), normalizes them, and creates corresponding alerts in IRIS. Deduplication uses a composite key so the same event never floods the queue twice.

2. IOC enrichment

For threat-intel rules, the integrator calls VirusTotal and AbuseIPDB before creating the IRIS alert. The analyst opens an alert and immediately sees: confidence score, threat category, last-seen date. No tab-switching, no copy-paste.

3. Multi-stage correlation

Some attacks only become visible when you look at multiple events together. Wazuh rules are stateless, so we run a correlation engine in PostgreSQL. The classic example is impossible travel — two VPN logins for the same user from cities physically too far apart to reach in the elapsed time:

async def _detect_impossible_travel(self, event: dict) -> dict | None:
    # ... look up the previous login for this user ...
    distance_km = haversine(loc1, loc2)
    min_hours = distance_km / 900  # max commercial flight speed
    actual_hours = (ts2 - ts1).total_seconds() / 3600
    if actual_hours < min_hours:
        return {"confirmed": True, "distance_km": distance_km, ...}

If Bangkok at 10:00 and Frankfurt at 11:30, that’s impossible. The integrator confirms the C1 detection and creates a high-severity IRIS alert with both login events attached.

4. Group deduplication

A single password-spray attack can generate thousands of Windows 4625 events in an hour. If each one becomes an IRIS alert, the queue is useless. So we group by (rule_id, user, host) with a per-rule cooldown:

_GROUP_DEDUP_RULES: dict[str, int] = {
    "110341": 2,   # privileged account auth fail — 2h
    "110342": 2,   # service account auth fail — 2h
    "110344": 4,   # auth fail from public IP — 4h
    "110347": 4,   # runas / privilege impersonation — 4h
    "110359": 1,   # password spray base — 1h
}

The first event creates one IRIS alert. Subsequent events within the cooldown update last_seen and are logged for forensic detail, but do not duplicate the alert. Crucially, the suppression logic is explicit and auditable — you can see exactly which events were suppressed and why.

5. KPI tracking with SLAs

Every alert carries an SLA timer driven by its severity:

Severity SLA target
Critical 1 hour
High 4 hours
Medium 8 hours
Low 24 hours

The IRIS dashboard shows a live progress bar per alert. Green within SLA, amber over 75%, red on breach. Management gets honest numbers; analysts see what is actually urgent.

Each alert also includes a structured triage guideline loaded from a YAML file at startup:

guidelines:
  110346:
    use_case: "A4-06 — Auth success from public IP"
    steps:
      - "Identify user and source IP immediately"
      - "Verify if expected (remote work, travel)"
      - "Check for subsequent privileged actions"
      - "If unauthorized: force logout, reset credentials, block IP"

The analyst opens the alert and sees the playbook. No context switch to a runbook wiki.

Why a separate service instead of Wazuh’s built-in integrations?

Wazuh has an "active response" framework that can run scripts on rule matches. We didn’t use it. Four reasons:

  1. State — correlation needs history. Active-response scripts are stateless.
  2. Async I/O — calling VirusTotal and AbuseIPDB in parallel is trivial in async Python and painful in shell.
  3. Testability — the integrator has a real REST API. We can replay any event with POST /monitor/wazuh/ingest and inspect every decision.
  4. Decoupling — IRIS, PagerDuty, Shuffle are all swappable. Wazuh knows nothing about any of them.

The 27-second dashboard problem

Early in production, the IRIS KPI dashboard took 27 seconds to load. Direct calls to the endpoint were fast (128 ms). Through the proxy chain — slow.

Root cause: every database call used psycopg.connect(), which is a synchronous, blocking call (~34 ms each, mostly TCP setup and auth). Our auto-sync processes about 437 events every 5 seconds, with 2 DB calls each:

437 events × 2 DB calls × 34 ms = ~30 seconds of event-loop starvation

Any HTTP request arriving during that window queued up behind the sync. The fix was to replace per-call connections with a persistent pool:

# Before: opens a new TCP connection every time
@contextmanager
def get_conn():
    with psycopg.connect(db_dsn(), ...) as conn:
        yield conn

# After: borrows from a pool — about 1 ms per call
_pool: ConnectionPool | None = None

def init_pool() -> None:
    global _pool
    _pool = ConnectionPool(db_dsn(), min_size=2, max_size=10, ...)

@contextmanager
def get_conn():
    with _pool.connection() as conn:
        yield conn

Dashboard load time: 27 s → 0.2 s. The lesson: in any FastAPI service that does per-request database work, use an async pool from day one.

Closing the feedback loop

When the integrator confirms a multi-stage detection (like impossible travel), it sends a structured syslog message back to Wazuh. Wazuh has a rule that matches the message and fires a level-15 alert in its own dashboard. So analysts working in either Wazuh or IRIS see the same event. Neither tool becomes the "main" view.


Part 3 — DFIR-IRIS: customize without forking

DFIR-IRIS is solid out of the box — alerts, cases, IOCs, timelines, reporting. We extended it in two ways without touching the upstream code beyond a single line.

A KPI dashboard as a Flask blueprint

IRIS is a Flask app, so adding a custom page is straightforward: write a blueprint, register it once, and you are done.

@kpi_dashboard_blueprint.route('/kpi-dashboard')
@ac_requires(no_cid_required=True)
def kpi_dashboard(caseid, url_redir):
    return render_template('kpi_dashboard.html', csrf_token=generate_csrf())

@kpi_dashboard_blueprint.route('/kpi-dashboard/api/alerts')
@ac_api_requires(Permissions.alerts_read)
def proxy_list_alerts():
    content, status, _ = _soc_get('/iris/alerts', request.args)
    return Response(content, status=status, content_type='application/json')

The frontend is Alpine.js — small, no build step required, polls the proxy endpoint and renders the live alert table with SLA timers and one-click assignment.

Structured alert notes

Instead of free-form text, every IRIS alert carries a JSON note with a fixed shape:

{
  "rule": {
    "id": "110344",
    "description": "Windows auth failure from public IP",
    "mitre": ["T1110"]
  },
  "asset": {
    "hostname": "FPFTPSRV02",
    "ip": "172.16.10.50",
    "os": "Windows"
  },
  "network": {
    "src_ip": "91.202.x.x",
    "dst_ip": "172.16.10.50",
    "protocol": "TCP"
  },
  "guideline": {
    "use_case": "Auth fail from public IP",
    "steps": ["Identify source IP and geolocation", "..."]
  }
}

The note is human-readable in the UI and machine-readable for filters and reports. Analysts can search by rule.id or asset.hostname; managers can group by MITRE technique for trend reports.

Modifying IRIS without forking it

The temptation is always to fork the project, modify the source, ship. The cost shows up six months later when upstream releases a new version and your fork has drifted too far to merge.

Our approach:

  • All custom blueprints live in a directory bind-mounted into the container
  • The frontend is compiled separately and injected as a static bundle
  • One line in the upstream __init__.py registers our blueprint

That single line is the only thing we re-apply when IRIS releases a new version. Everything else is additive.


Part 4 — Operational lessons

Disk usage will surprise you

logall=yes in the Wazuh config archives every decoded event, not just the ones that fire alerts. It is great for forensic deep-dives and terrible for disk space. A routine VMware investigation pushed our disk to 86 % utilization. We now keep it off by default and switch it on only for targeted investigations:

<global>
  <logall>no</logall>
  <logall_json>no</logall_json>
</global>

Coverage versus alert fatigue is a tunable, not a tradeoff

The temptation is to suppress noisy rules. The better answer is the group-deduplication pattern above: every event still gets logged, but only the first per cooldown window creates an alert. You keep coverage, you keep auditability, and the analyst queue stays usable.

Monitor your own SOC

The integrator pings its own dependencies — Wazuh manager, Wazuh indexer, IRIS, Shuffle, PagerDuty — every 2 minutes. If anything is down, it creates an IRIS alert in a system-health category. Analysts see infrastructure failures in the same queue as security alerts. No second monitoring tool needed.

Email is still the ground truth notification

Critical IRIS alerts trigger an email to the SOC team and a backup mailbox with the alert title, severity, asset, and a direct link. Dashboards fail. Inboxes do not.


The numbers

Metric Value
Custom Wazuh rules 86 across 8 files
Production rules firing 17 of 64
IRIS alerts processed 91,000+
Auto-sync cycle every 5 seconds
IOC list refresh every 4 hours
KPI dashboard load (before fix) 27 seconds
KPI dashboard load (after fix) 0.2 seconds
DB call overhead (before / after) 34 ms / under 1 ms

What we would do differently

Async database from day one. Synchronous psycopg.connect() in a FastAPI app is a latent performance trap. Start with psycopg_pool.AsyncConnectionPool before the first repository file is written.

Wazuh rule testing in CI. The rule_test API accepts raw log lines and returns which rule fires. Wrap that in pytest fixtures and you catch rule conflicts before they hit production.

Separate ports for simulation and production logs. We multiplexed both onto syslog 514 and ended up with the dual-profile rule split. A second port would have been simpler.


The takeaway

A production SOC built entirely on open-source software is not only possible, it is genuinely good engineering. But the open-source pieces alone do not constitute a product. The integration layer — the part nobody markets — is where the actual SOC lives.

If you are evaluating commercial SIEM and the licensing math is making you uncomfortable, this stack is a real alternative. The hard part is not the tools. The hard part is the glue.


At Simplico we build production security systems on open-source foundations. If you are designing a SOC, modernizing detection, or trying to escape SIEM licensing costs, get in touch.


Get in Touch with us

Chat with Us on LINE

iiitum1984

Speak to Us or Whatsapp

(+66) 83001 0222

Related Posts

Our Products