The format was immaculate - every field properly typed, every attribute carefully specified, the suggestion clear and actionable.
The current narrative around structured outputs is compelling. OpenAI, LlamaIndex, and others promote them as the solution to AI reliability. The pitch is seductive: Define a schema, get perfectly formatted responses, never worry about parsing again.
The response was perfectly structured. Every field was correctly typed. The suggestion seemed logical. There was just one critical problem: getUser() is idempotent and caches results, while validateToken() has side effects that refresh the token. The suggested reordering would break our token refresh mechanism.
This isn’t a parsing error or a hallucination in the traditional sense. The AI followed its output schema perfectly. It just made incorrect assumptions about our codebase’s behavior.
I’ve found that the real challenge lies in what happens before and after those perfectly formatted responses. Building AI tooling has taught me that we’re optimizing for the wrong thing. While everyone focuses on making AI outputs more structured, the real challenges lie elsewhere: