The question I keep hearing isn't 'what can it do?' It's 'how can I trust it?'
David Yenicelik
Founder
Last week, Matt Shumer pointed out something that’s clearly accelerating in AI. Models are starting to help build the next generation of models. Tasks that used to take hours now take minutes. The improvement curve looks steep enough to make you wonder where the ceiling is.
But speed is not structure. Capability doesn’t decide where a system belongs in the stack. If anything, the faster the system, the more important its placement and controls become.
Matt lists finance among the domains he expects to change quickly. As we’re building Stingray, we’ve been having a lot of conversations with stakeholders in the finance industry. Risk managers, investors, traders, treasurers. Individuals and small teams trying to automate pieces of their workflow without creating new failure modes.
And the question I keep hearing is not “what can it do?” but rather: “how can I trust it?”
When speed meets markets
Markets have already stress-tested fast autonomous systems.
Knight Capital didn’t fail because someone built an evil algorithm. A deployment revived an old code path. The system started trading. It kept trading. Forty-five minutes later, 460 million dollars were gone. Nothing novel. No adversary. Just a machine with permission to act, no hard boundary that stopped it, and no way of judging whether it was out of its depth.
The Flash Crash was similar. Independent algorithms, each doing what they were designed to do, feeding into each other in a tight loop. Liquidity disappeared. Prices collapsed. Then snapped back. Again, no single rogue intelligence. Just unbounded speed in a tightly coupled system.
The response was predictable. Circuit breakers. Pre-trade risk caps. Kill switches. Controls that sit between the intent and the execution.
Finance is an excellent example of how, in high-stakes environments, it’s not about survival in the abstract. It’s about risk-adjusted returns. An agent that generates attractive returns while carrying a small probability of a monumental blow-up is not sophisticated. It’s objectively mispriced risk.
Tail risk matters more than averages
In infrastructure engineering, you don’t optimize for the average case. You look at P95 and P99. In markets, you look at tail risk. The question isn’t “how often is it right?” It’s “what happens in the worst case?”
There’s a reason for that. You can prove a system is unsafe with a single counterexample. One catastrophic failure invalidates a thousand successful runs. You can’t prove a system is safe by showing it worked yesterday, or even a thousand times in a row. Safety is falsifiable. Robustness isn’t provable by repetition alone.
Bounded systems
Aviation figured this out decades ago. Autopilot flies most of the time. But it flies inside an envelope. It can be overridden instantly. When sensors fail, it disengages. The machine handles the stable regime. The human handles the regime break.
That boundary is built into the architecture, because it’s not something an LLM-based system can reliably predict. They’re probabilistic by design. Small changes in phrasing can change outputs. They hallucinate. They drift. They improve quickly.
Impressive. Not a risk framework.
If every output requires full manual verification, the agent hasn’t reduced risk. It’s only moved it. Automation that requires constant auditing isn’t autonomy. It’s cognitive overhead.
So a useful agent is trustworthy, which in turn means bounded. In finance, it shouldn’t increase exposure beyond a hard limit. It shouldn’t keep trading past a predefined loss threshold. It shouldn’t silently change strategy because the prompt evolved.
Then comes legibility. You should be able to reconstruct why it proposed a trade, audit what it saw, and explain it to an LP or a risk committee a week later without storytelling.
And then stability under stress. When data quality drops, it escalates. When uncertainty increases, it does less, not more. When the regime shifts, it pauses and hands control back cleanly.
Detecting the edge of the envelope
The harder question isn’t whether the agent can act. It’s whether you can detect when it’s left the envelope.
Do you know when your agent is out of its depth?
Agents are great at watching, summarizing, simulating, and proposing. They can help you keep up with an always-on market without living inside ten dashboards.
Execution authority is inherently different. Acceleration makes agents more useful. It doesn’t automatically grant them judgment. Capability can scale. Permission should not.
The real test is simple. When the next regime break hits, does your agent panic — or pause?