Why We Wrote Down Our Viewpoint
After six months of building ThruWire, the same structural problems kept showing up across AI systems, pushing us to make our point of view explicit.
For the past 6 months, we’ve been building ThruWire, and along the way we kept running into the same set of problems over and over again. These weren’t edge cases or one-off bugs, they were fundamental issues that show up as soon as you try to move beyond a demo and actually rely on AI to do real work.
What surprised us wasn’t that these problems existed. It was how consistently they appeared, regardless of model, stack, or approach. You get something working in a chat, you guide it carefully, you remind it what context to use and what to ignore, and eventually it produces something useful. But that result is fragile. The path that led to it isn’t preserved in a meaningful way. The context that made it work isn’t clearly defined. And the next time you try to do something similar, you have the same raw materials, maybe you have memories, but you still need to recreate the thread that got you there.
It feels like trying to catch lightning in a bottle, over and over again.
That experience led us to step back and ask a different question. Instead of asking how to get better outputs from models, we started asking why the systems around those models feel so unreliable, so hard to reuse, and so difficult to improve incrementally. The answer we kept coming back to was that most AI systems today aren’t really systems in the traditional sense. They’re sequences of calls, loosely connected by prompts and context, with very few hard guarantees about how they behave.
So we decided to write down our viewpoint.
This isn’t a marketing exercise, it’s a way to be explicit about what we believe is actually going wrong, and what has to change if AI is going to move from something you “use” in a chat to something you can depend on as part of a system.
A big part of that point of view is that most of what people experience as “non-determinism” in AI isn’t really coming from the model. It’s coming from the system. Context gets assembled slightly differently, dependencies resolve to different intermediate results, execution order shifts, and intermediate work isn’t preserved. When all of that is left implicit, two runs that look similar aren’t actually the same, and there’s no way to tell whether a difference in output came from the model or from everything around it. That makes debugging and trust fundamentally hard.
Another piece is how we think about reuse. Everyone tries to cache AI outputs at some point. It works in simple cases and then quietly breaks as systems become more complex. The root issue isn’t that caching is hard or that models are probabilistic. It’s that most systems don’t have a real definition of what makes two executions “the same.” If identity is just “same prompt,” then as soon as you introduce dependencies, dynamic context, and structure, that definition collapses. Without a precise notion of execution identity, reuse is always either too aggressive or not aggressive enough, and you can’t safely build on prior work.
We also found ourselves questioning how context is handled. There’s a constant tension between starting fresh to avoid drift and carrying forward long histories to preserve continuity. In practice, both approaches degrade. Fresh runs lose useful work. Long-running contexts become messy, slow, and harder to reason about. What this exposed for us is that context is being treated as something to manage manually, instead of something the system should produce. If the system actually executes and materializes its dependencies, then context doesn’t need to be continuously appended. It emerges from the work that has already been done.
The same pattern showed up when we looked at reasoning. There’s a lot of emphasis on chain-of-thought as a way to understand what models are doing. But when we tried to rely on it, it became clear that what models stream isn’t their actual internal reasoning. It’s not the hidden state or the real scratch work. It’s a reconstruction, generated after the fact, optimized for readability. It leaves out the discarded paths and the real decision process. That makes it interesting to read, but not something you can build a system on. If reasoning is going to matter, it has to exist as structured, persistent state that can be inspected, reused, and refined, not just as text in a stream.
We saw something similar with the boundary between models and tools. The common pattern is that models think and tools act. But in practice, many “tools” require reasoning: transforming data, synthesizing outputs, coordinating multiple steps. At that point, either everything collapses back into a single opaque model call, or you allow tools themselves to incorporate reasoning. We think the second path is the only one that scales. It lets you build systems as compositions of smaller reasoning units instead of one large, monolithic one.
And finally, all of this comes back to the interface. Chat has become the default way to interact with AI, and it’s incredibly powerful for exploration. But it’s also inherently ephemeral. It exposes a stream of messages, not a structured system. Streaming tokens makes things feel faster, but it doesn’t solve the underlying problem that the system’s state isn’t clearly defined or preserved. Once you try to build something more complex, you feel that limitation immediately.
Our point of view is that there’s a different way to work with AI, one that is less about conversations and more about execution. Where work is structured, state is explicit, identity is defined, and results can be reused and improved over time.
That’s what we’re building toward with ThruWire.
Writing this down is our way of being clear about where we stand and what we think needs to change. We expect parts of it to be wrong or incomplete. But these are the constraints that have held up for us so far, and they’re the ones shaping the system we’re building.
If any of this resonates, you’ve probably run into the same problems we have.