April 18, 20266 min read

Anatomy of a failed signal: what AI taught us from 4 botched WTI calls in a row

methodologytransparencyaibacktestingoilsignals

A product that predicts markets and doesn't show its misses is lying by omission. This article does the opposite: we take our last five missed WTI signals, open them up, and report what the automated autopsy revealed — including the concrete fixes we shipped right after.

The context: Iran crisis, mid-April 2026

Between 14 and 17 April, four independent rules from the GeoPulse engine fired WTI long signals over the same 48-hour window:

Rule	Predicted magnitude	Confidence
`mideast_conflict_oil`	+7.5% over 2d	75%
`oil_chokepoint` (Strait of Hormuz)	+10% over 2d	80%
`nlp_signal_wti` (Mistral analysis)	+6% over 2d	85%
`sanctions_announced`	+2.5% over 2d	65%

Each rule's logic was defensible. Major Middle East conflict → supply shock → oil up. Intense Hormuz news → anticipated blockade → extra shock. Mistral's NLP scoring agreed. Everything converged.

Reality: WTI dropped -7.5% to -8.4% over the next 48h. Four predictions, four failures. Not partial misses, not directional moves of insufficient size. The market went exactly the other way.

Rather than re-tuning in silence

Many signal systems handle this kind of run behind the scenes: you nudge a weight, you move on, the customer never hears about the errors. Our approach is the opposite — every missed signal gets a public post-mortem in the scoreboard.

Until recently, that post-mortem was a generic phrase ("the asymmetric +5%/-10% mechanism reduced the rule's weight"). Honest about the mechanism, useless as a diagnosis. Why did the signal miss? Which factor dominated? We didn't say, because we didn't really know.

Since this week, a call to Mistral Large is triggered at the resolution of every signal. The model receives, as input: the rule, the prediction (asset, direction, magnitude, confidence, horizon), the actual move observed, the entry and exit prices, and the triggering event. It produces two things:

A primary cause from a closed enum: correct, already_priced, magnitude_too_ambitious, wrong_direction, regime_mismatch, peripheral_event, noise, unknown
A 2–3 sentence analysis justifying the verdict with concrete numbers

The result is stored in the database, never recomputed, and shown in the scoreboard in place of the old generic phrase.

What the autopsies said about our 5 misses

Here are the verdicts for the five WTI/XAU signals missed in the Iran window:

Signal	AI cause	Reading
`mideast_conflict_oil` (WTI ↑7.5%)	Wrong direction	Market went -8.4% instead of the +7.5% predicted
`oil_chokepoint` (WTI ↑10%)	Wrong direction	Same, -8.4% vs +10%
`nlp_signal_wti` (WTI ↑6%)	Wrong direction	-8.2% vs +6%
`sanctions_announced` (WTI ↑2.5%)	Wrong direction	-7.5% vs +2.5%
`nuclear_tensions` (XAU ↑4%)	Wrong direction	XAU at -1% instead of the +4% predicted

Five times the same cause across independent rules: that's no longer noise, it's a pattern. And the reading is uncomfortable: the oil risk premium had already been priced in by the market well before our signals fired. The news that activated our rules were themselves news of "resolution" or de-escalation — not of escalation.

Another recurring verdict in the autopsies of the other assets in the window: magnitude_too_ambitious. On five signals where direction was correct (XAU ↑, wheat ↑, copper ↑), our predictions were systematically between +4% and +7.5%, while real moves were +0.8% to +2.2%. Right in sign, wrong in amplitude.

Three concrete fixes shipped

Autopsy is only useful if it ships code. Here's what we pushed to production the same day:

1. Volatility regime filter

Our geopolitical signals were evaluated independently of the market regime. But a "Middle East → oil up" rule has a very different hit rate when VIX sits at 12 (complacency) vs 30 (generalised stress). Now, every signal sees its confidence and magnitude adjusted by a VIX-linked multiplier:

VIX < 15 → confidence ×0.85, magnitude ×0.90 (news ignored by a market in a trance)
VIX 25–35 → confidence ×0.95, magnitude ×1.05 (more receptive regime)
VIX > 35 → confidence ×0.90, magnitude ×1.15 (amplified but erratic moves)

2. "Already priced" detection

If four independent rules all point at the same asset in the same direction over 48h, it's probably because the theme is publicly resolving, not escalating. Now, the 2nd signal on the same asset within a 48h window has its magnitude scaled down to 85%. The 3rd to 70%. The 4th to 55%.

This isn't censoring the signal — it's recognising that its informational margin decays with each repetition.

3. Historical volatility cap

A "WTI +10% over 2 days" prediction lives in the tail of WTI's actual 2-day return distribution. It probably verifies maybe 1 time in 100. Now, every predicted magnitude is capped at 1.5σ × √horizon measured on 14 real days of prices. For WTI on 2d in April 2026, that puts the cap around 4.5%. Anything above gets pulled down.

Less ambitious as marketing copy. Much more defensible as statistics.

What we learned

Three insights we wouldn't have had without the automated autopsy system:

Rules of the form "geopolitical event → price move" are weakened by the speed of market pricing. In the era of always-on news and algo trading, the lag between an event and its price integration has gone from days to minutes. Our rules, designed on a "reaction within a few days" horizon, are structurally late.

Concordance of multiple rules isn't confirmation, it's an alarm signal. When four different rules go the same way in 48h, it's not that conviction is stronger; it's that the market is exhausting the news.

Radical transparency has an image cost but a methodological benefit. Publishing that we missed four WTI in a row isn't comfortable. But without that transparency, we wouldn't have spotted the pattern, wouldn't have reconfigured the rules, and would re-do the same error in two months on the next crisis.

See the autopsies live

All Mistral-generated autopsies are visible in the public scoreboard, "Error autopsy" section. Each card is expandable and shows the AI verdict with its colored badge (wrong direction, magnitude too ambitious, already priced, regime mismatch, etc.) and the contextual analysis.

It's an educational tool for you, and a discipline tool for us. When you read "magnitude too ambitious" five times in ten signals in a row, you no longer have the luxury of ignoring the diagnosis.

The next iteration will be quantitative: as the autopsy base grows, we'll cross-tab cause × rule × regime × asset and identify exactly which rules deserve to be deactivated, recalibrated or refounded. But it starts by publishing errors as they happen — not after they've been forgotten.

GeoPulse

Follow the markets in real time

GeoPulse correlates geopolitical events with financial markets using AI analysis of every event.

Create a free account