Are we truly ready for large model application development, or are we still stuck in the mindset of “as long as it works”?
Over the past few decades, software engineering has focused on reducing system risk and uncertainty through various methodologies. We’ve developed numerous approaches and frameworks to drive rapid business growth: TDD, BDD, DDD — each guiding us to minimize project uncertainty and ensure system stability. However, the advent of large language models (LLMs) challenges this stability, introducing a new reality: not everything will be stable anymore.
The inherent complexity of LLMs means their behavior is not as stable or predictable as code-based logic. In traditional software engineering, with well-designed code, we expect a deterministic output for a given input. But with LLMs, we can’t be sure that the same input will always yield the same output. This introduces a crack in our traditional software engineering approach: while the logic of your system may be clear, the output can still be random, introducing instability.
This randomness is why development in the LLM application era differs from what we’re used to. We must now focus on model evaluation. Evaluation, once a peripheral concern for algorithm specialists, is now crucial for every LLM application developer. To put it bluntly, evaluation is the business.