How We Built a Real Security Operations Center With Open-Source Tools
A behind-the-scenes look at how we built a production SOC for a mid-sized enterprise using Wazuh, DFIR-IRIS, and a custom Python middleware — what worked, what broke, and the engineering decisions that actually mattered.
If you have ever priced a commercial SIEM or SOC platform, you already know the punch line: licensing alone can dwarf the salaries of the analysts who use it. The promise of open-source SOC tooling is obvious. The reality — making those tools behave like a coherent product — is where most projects stall.
This post is the field report. No SaaS. No cloud lock-in. Everything runs on Docker on the customer’s own servers, and total software cost is zero.
The stack at a glance
| Layer | Tool | What it does |
|---|---|---|
| Detection | Wazuh 4.x | Collects logs, decodes them, fires alerts on suspicious patterns |
| Case management | DFIR-IRIS | Where analysts triage and resolve alerts |
| Integration | SOC Integrator (FastAPI) | The middleware that ties everything together |
| Threat intel | VirusTotal + AbuseIPDB | Enriches alerts with IOC reputation data |
| Paging | PagerDuty | Wakes up the on-call analyst at 3 AM |
| Automation | Shuffle SOAR | Workflow automation for IOCs |
| Log sources | FortiGate, Windows AD, VMware ESXi, Sysmon | The systems we are actually monitoring |
Here is how the pieces talk to each other:
flowchart TD
A["FortiGate / Windows AD / VMware / Sysmon"] -->|"syslog / agent"| B["Wazuh Manager"]
B -->|"decoded events + alerts"| C["Wazuh Indexer"]
C -->|"poll every 5s"| D["SOC Integrator (FastAPI)"]
D -->|"enrich"| E["VirusTotal + AbuseIPDB"]
D -->|"create alerts"| F["DFIR-IRIS"]
D -->|"page on-call"| G["PagerDuty"]
D -->|"trigger workflow"| H["Shuffle SOAR"]
F -->|"analysts triage"| I["KPI Dashboard"]
The thing nobody tells you about open-source SOC tooling: the tools themselves are excellent, but they are not designed to work as a single product. The integrator is the layer that makes it feel like one.
Part 1 — Wazuh: from generic detection to your environment
Wazuh ships with about 3,500 rules out of the box. They cover the obvious things: brute force attempts, common malware, known-bad processes. But generic rules catch generic threats. To detect what actually matters in your environment — your privileged accounts, your IP ranges, your unusual patterns — you have to write rules of your own.
How we organize custom rules
We split rules by use-case appendix, mirroring the detection design document the customer signed off on:
wazuh_cluster/rules/
soc-a1-ioc-rules.xml # DNS / threat-intel hits
soc-a2-fortigate-fw-rules.xml # Firewall events
soc-a3-fortigate-vpn-rules.xml# VPN tunnel events
soc-a4-windows-ad-rules.xml # Windows authentication
soc-b1-vmware-rules.xml # vCenter / ESXi
soc-b2-logmon-rules.xml # Log-loss monitoring
soc-c1-c3-rules.xml # Multi-stage correlations
soc-ioc-cdb-rules.xml # Threat-intel list lookups
Every rule lives in two flavours: a simulation version (safe to fire in a test environment, IDs in the 100xxx band) and a production version (tuned for real traffic, IDs in the 110xxx band). Both flavours sit in the same file so a single test session validates both.
<!-- Simulation: fires on JSON events from our test simulator -->
<rule id="100341" level="10">
<if_sid>110350</if_sid>
<description>A4-01 [SIM] Windows: privileged account auth failure</description>
<group>soc_sim,a4,windows,auth,</group>
</rule>
<!-- Production: matches real Windows event log fields -->
<rule id="110341" level="10">
<if_group>windows</if_group>
<field name="win.eventdata.targetUserName" type="pcre2">(?i)(admin|adm_|svc_|sa_|-adm)</field>
<field name="win.system.eventID">^4625$</field>
<description>A4-01 [PROD] Privileged account auth failure</description>
<group>soc_prod,a4,windows,auth,</group>
<mitre><id>T1110</id></mitre>
</rule>
A small detail that catches everyone: the location field
When a device sends syslog directly to Wazuh (no agent installed), Wazuh sets agent.name = wazuh.manager and stores the source IP in a field called location. This is barely documented, and we lost a few hours to it before the penny dropped.
For VMware ESXi, which only supports syslog, you have to filter by IP in location:
<rule id="110401" level="12">
<if_group>vmware</if_group>
<field name="location" type="pcre2">^172\.16\.0\.(107|108|109|110)$</field>
<match>Login failure</match>
<description>B1-01 [PROD] vCenter login failure</description>
<mitre><id>T1110</id></mitre>
</rule>
Without that filter, any device sending the string "Login failure" would trigger the rule. Scoping it to four ESXi hosts kills the false positives.
Chaining to built-in rules instead of fighting them
ESXi SSH events are parsed by Wazuh’s sshd decoder, not the vmware decoder. So if_group=vmware cannot catch them. The right move is to inherit from the rule that already fires:
<rule id="110404" level="10">
<if_sid>5715</if_sid> <!-- built-in SSH success rule -->
<field name="location" type="pcre2">^172\.16\.0\.(107|108|109|110)$</field>
<description>B1-04 [PROD] ESXi SSH success — verify authorized</description>
<mitre><id>T1021.004</id></mitre>
</rule>
Rule 5715 has already done the parsing. We just narrow it to our hosts.
Threat intel via CDB lists
Wazuh’s CDB (Constant Database) lookup checks a field against a flat file in O(log n). We maintain three lists — malicious IPs, malicious domains, malware hashes — and the integrator refreshes them every 4 hours from VirusTotal and AbuseIPDB, only including IOCs with confirmed hits in the past 30 days.
<rule id="110600" level="13">
<if_sid>22101</if_sid>
<list field="data.srcip" lookup="match_key">etc/lists/malicious-ioc/malicious-ip</list>
<description>FortiGate source IP matched threat-intel list</description>
<mitre><id>T1071</id></mitre>
</rule>
The Docker gotcha that cost us hours
Wazuh’s official Docker image stores rules in a named volume mounted at /var/ossec/etc. If you try to update a rule file with docker cp or sed -i, it fails silently or with "Device or resource busy." The named volume’s inode wins.
The only reliable way to edit rules in place — without restarting the container — is to write through Python’s open().write() from inside the container:
docker exec wazuh.manager python3 -c "
with open('/var/ossec/etc/rules/soc-a4-windows-ad-rules.xml', 'w') as f:
f.write('''<group name=\"soc_mvp,...\">...</group>''')
"
docker exec wazuh.manager /var/ossec/bin/wazuh-control reload
Every CI/CD pipeline that touches Wazuh rules in production should know this.
Part 2 — The integrator: the layer that makes it a product
This is the piece that makes Wazuh, IRIS, and the other tools feel like one system instead of four. It is a FastAPI service. It does five things.
1. Alert synchronization
Every 5 seconds, the integrator queries Wazuh’s indexer for new events matching production rule IDs (rule.id:[110301 TO 110602]), normalizes them, and creates corresponding alerts in IRIS. Deduplication uses a composite key so the same event never floods the queue twice.
2. IOC enrichment
For threat-intel rules, the integrator calls VirusTotal and AbuseIPDB before creating the IRIS alert. The analyst opens an alert and immediately sees: confidence score, threat category, last-seen date. No tab-switching, no copy-paste.
3. Multi-stage correlation
Some attacks only become visible when you look at multiple events together. Wazuh rules are stateless, so we run a correlation engine in PostgreSQL. The classic example is impossible travel — two VPN logins for the same user from cities physically too far apart to reach in the elapsed time:
async def _detect_impossible_travel(self, event: dict) -> dict | None:
# ... look up the previous login for this user ...
distance_km = haversine(loc1, loc2)
min_hours = distance_km / 900 # max commercial flight speed
actual_hours = (ts2 - ts1).total_seconds() / 3600
if actual_hours < min_hours:
return {"confirmed": True, "distance_km": distance_km, ...}
If Bangkok at 10:00 and Frankfurt at 11:30, that’s impossible. The integrator confirms the C1 detection and creates a high-severity IRIS alert with both login events attached.
4. Group deduplication
A single password-spray attack can generate thousands of Windows 4625 events in an hour. If each one becomes an IRIS alert, the queue is useless. So we group by (rule_id, user, host) with a per-rule cooldown:
_GROUP_DEDUP_RULES: dict[str, int] = {
"110341": 2, # privileged account auth fail — 2h
"110342": 2, # service account auth fail — 2h
"110344": 4, # auth fail from public IP — 4h
"110347": 4, # runas / privilege impersonation — 4h
"110359": 1, # password spray base — 1h
}
The first event creates one IRIS alert. Subsequent events within the cooldown update last_seen and are logged for forensic detail, but do not duplicate the alert. Crucially, the suppression logic is explicit and auditable — you can see exactly which events were suppressed and why.
5. KPI tracking with SLAs
Every alert carries an SLA timer driven by its severity:
| Severity | SLA target |
|---|---|
| Critical | 1 hour |
| High | 4 hours |
| Medium | 8 hours |
| Low | 24 hours |
The IRIS dashboard shows a live progress bar per alert. Green within SLA, amber over 75%, red on breach. Management gets honest numbers; analysts see what is actually urgent.
Each alert also includes a structured triage guideline loaded from a YAML file at startup:
guidelines:
110346:
use_case: "A4-06 — Auth success from public IP"
steps:
- "Identify user and source IP immediately"
- "Verify if expected (remote work, travel)"
- "Check for subsequent privileged actions"
- "If unauthorized: force logout, reset credentials, block IP"
The analyst opens the alert and sees the playbook. No context switch to a runbook wiki.
Why a separate service instead of Wazuh’s built-in integrations?
Wazuh has an "active response" framework that can run scripts on rule matches. We didn’t use it. Four reasons:
- State — correlation needs history. Active-response scripts are stateless.
- Async I/O — calling VirusTotal and AbuseIPDB in parallel is trivial in async Python and painful in shell.
- Testability — the integrator has a real REST API. We can replay any event with
POST /monitor/wazuh/ingestand inspect every decision. - Decoupling — IRIS, PagerDuty, Shuffle are all swappable. Wazuh knows nothing about any of them.
The 27-second dashboard problem
Early in production, the IRIS KPI dashboard took 27 seconds to load. Direct calls to the endpoint were fast (128 ms). Through the proxy chain — slow.
Root cause: every database call used psycopg.connect(), which is a synchronous, blocking call (~34 ms each, mostly TCP setup and auth). Our auto-sync processes about 437 events every 5 seconds, with 2 DB calls each:
437 events × 2 DB calls × 34 ms = ~30 seconds of event-loop starvation
Any HTTP request arriving during that window queued up behind the sync. The fix was to replace per-call connections with a persistent pool:
# Before: opens a new TCP connection every time
@contextmanager
def get_conn():
with psycopg.connect(db_dsn(), ...) as conn:
yield conn
# After: borrows from a pool — about 1 ms per call
_pool: ConnectionPool | None = None
def init_pool() -> None:
global _pool
_pool = ConnectionPool(db_dsn(), min_size=2, max_size=10, ...)
@contextmanager
def get_conn():
with _pool.connection() as conn:
yield conn
Dashboard load time: 27 s → 0.2 s. The lesson: in any FastAPI service that does per-request database work, use an async pool from day one.
Closing the feedback loop
When the integrator confirms a multi-stage detection (like impossible travel), it sends a structured syslog message back to Wazuh. Wazuh has a rule that matches the message and fires a level-15 alert in its own dashboard. So analysts working in either Wazuh or IRIS see the same event. Neither tool becomes the "main" view.
Part 3 — DFIR-IRIS: customize without forking
DFIR-IRIS is solid out of the box — alerts, cases, IOCs, timelines, reporting. We extended it in two ways without touching the upstream code beyond a single line.
A KPI dashboard as a Flask blueprint
IRIS is a Flask app, so adding a custom page is straightforward: write a blueprint, register it once, and you are done.
@kpi_dashboard_blueprint.route('/kpi-dashboard')
@ac_requires(no_cid_required=True)
def kpi_dashboard(caseid, url_redir):
return render_template('kpi_dashboard.html', csrf_token=generate_csrf())
@kpi_dashboard_blueprint.route('/kpi-dashboard/api/alerts')
@ac_api_requires(Permissions.alerts_read)
def proxy_list_alerts():
content, status, _ = _soc_get('/iris/alerts', request.args)
return Response(content, status=status, content_type='application/json')
The frontend is Alpine.js — small, no build step required, polls the proxy endpoint and renders the live alert table with SLA timers and one-click assignment.
Structured alert notes
Instead of free-form text, every IRIS alert carries a JSON note with a fixed shape:
{
"rule": {
"id": "110344",
"description": "Windows auth failure from public IP",
"mitre": ["T1110"]
},
"asset": {
"hostname": "FPFTPSRV02",
"ip": "172.16.10.50",
"os": "Windows"
},
"network": {
"src_ip": "91.202.x.x",
"dst_ip": "172.16.10.50",
"protocol": "TCP"
},
"guideline": {
"use_case": "Auth fail from public IP",
"steps": ["Identify source IP and geolocation", "..."]
}
}
The note is human-readable in the UI and machine-readable for filters and reports. Analysts can search by rule.id or asset.hostname; managers can group by MITRE technique for trend reports.
Modifying IRIS without forking it
The temptation is always to fork the project, modify the source, ship. The cost shows up six months later when upstream releases a new version and your fork has drifted too far to merge.
Our approach:
- All custom blueprints live in a directory bind-mounted into the container
- The frontend is compiled separately and injected as a static bundle
- One line in the upstream
__init__.pyregisters our blueprint
That single line is the only thing we re-apply when IRIS releases a new version. Everything else is additive.
Part 4 — Operational lessons
Disk usage will surprise you
logall=yes in the Wazuh config archives every decoded event, not just the ones that fire alerts. It is great for forensic deep-dives and terrible for disk space. A routine VMware investigation pushed our disk to 86 % utilization. We now keep it off by default and switch it on only for targeted investigations:
<global>
<logall>no</logall>
<logall_json>no</logall_json>
</global>
Coverage versus alert fatigue is a tunable, not a tradeoff
The temptation is to suppress noisy rules. The better answer is the group-deduplication pattern above: every event still gets logged, but only the first per cooldown window creates an alert. You keep coverage, you keep auditability, and the analyst queue stays usable.
Monitor your own SOC
The integrator pings its own dependencies — Wazuh manager, Wazuh indexer, IRIS, Shuffle, PagerDuty — every 2 minutes. If anything is down, it creates an IRIS alert in a system-health category. Analysts see infrastructure failures in the same queue as security alerts. No second monitoring tool needed.
Email is still the ground truth notification
Critical IRIS alerts trigger an email to the SOC team and a backup mailbox with the alert title, severity, asset, and a direct link. Dashboards fail. Inboxes do not.
The numbers
| Metric | Value |
|---|---|
| Custom Wazuh rules | 86 across 8 files |
| Production rules firing | 17 of 64 |
| IRIS alerts processed | 91,000+ |
| Auto-sync cycle | every 5 seconds |
| IOC list refresh | every 4 hours |
| KPI dashboard load (before fix) | 27 seconds |
| KPI dashboard load (after fix) | 0.2 seconds |
| DB call overhead (before / after) | 34 ms / under 1 ms |
What we would do differently
Async database from day one. Synchronous psycopg.connect() in a FastAPI app is a latent performance trap. Start with psycopg_pool.AsyncConnectionPool before the first repository file is written.
Wazuh rule testing in CI. The rule_test API accepts raw log lines and returns which rule fires. Wrap that in pytest fixtures and you catch rule conflicts before they hit production.
Separate ports for simulation and production logs. We multiplexed both onto syslog 514 and ended up with the dual-profile rule split. A second port would have been simpler.
The takeaway
A production SOC built entirely on open-source software is not only possible, it is genuinely good engineering. But the open-source pieces alone do not constitute a product. The integration layer — the part nobody markets — is where the actual SOC lives.
If you are evaluating commercial SIEM and the licensing math is making you uncomfortable, this stack is a real alternative. The hard part is not the tools. The hard part is the glue.
At Simplico we build production security systems on open-source foundations. If you are designing a SOC, modernizing detection, or trying to escape SIEM licensing costs, get in touch.
Get in Touch with us
Related Posts
- 用纯开源方案搭建生产级 SOC:Wazuh + DFIR-IRIS + 自研集成层实战记录
- FarmScript:我们如何从零设计一门农业IoT领域特定语言
- FarmScript: How We Designed a Programming Language for Chanthaburi Durian Farmers
- 智慧农业项目为何止步于试点阶段
- Why Smart Farming Projects Fail Before They Leave the Pilot Stage
- ERP项目为何总是超支、延期,最终令人失望
- ERP Projects: Why They Cost More, Take Longer, and Disappoint More Than Expected
- AI Security in Production: What Enterprise Teams Must Know in 2026
- 弹性无人机蜂群设计:具备安全通信的无领导者容错网状网络
- Designing Resilient Drone Swarms: Leaderless-Tolerant Mesh Networks with Secure Communications
- NumPy广播规则详解:为什么`(3,)`和`(3,1)`行为不同——以及它何时会悄悄给出错误答案
- NumPy Broadcasting Rules: Why `(3,)` and `(3,1)` Behave Differently — and When It Silently Gives Wrong Answers
- 关键基础设施遭受攻击:从乌克兰电网战争看工业IT/OT安全
- Critical Infrastructure Under Fire: What IT/OT Security Teams Can Learn from Ukraine’s Energy Grid
- LM Studio代码开发的系统提示词工程:`temperature`、`context_length`与`stop`词详解
- LM Studio System Prompt Engineering for Code: `temperature`, `context_length`, and `stop` Tokens Explained
- LlamaIndex + pgvector: Production RAG for Thai and Japanese Business Documents
- simpliShop:专为泰国市场打造的按需定制多语言电商平台
- simpliShop: The Thai E-Commerce Platform for Made-to-Order and Multi-Language Stores
- ERP项目为何失败(以及如何让你的项目成功)













