I've often found myself uneasy when LLMs (Large Language Models) are asked to quote the Bible. While they can provide insightful discussions about faith, their tendency to hallucinate responses raises concerns when dealing with scripture, which we regard as the inspired Word of God.
To explore these concerns, I created a benchmark to evaluate how accurately LLMs can recall scripture word for word. Here's a breakdown of my methodology and the test results.
To ensure consistent and fair evaluation, I tested each model using six scenarios designed to measure their ability to accurately recall scripture. For readers interested in the technical details, the source code for the tests is available here. All tests were conducted with a temperature setting of 0, and I have given slack to the models by making the pass check case and whitespace insensitive.
A temperature of 0 ensures the models generate the most statistically probable response at each step, minimising creativity or variability and prioritising accuracy. This approach is particularly important when evaluating fixed reference material like the Bible, where precise wording matters.