Following LLM Manufacturer’s Instructions

submited by
Style Pass
2024-10-22 10:30:06

As uses of LLMs coalesce around specific applications, it remains tricky to know where to start given the range of LLM providers, base models and the guidelines they each offer. Model benchmarks or leaderboards fail to consider the differences in guidance on how models are best deployed as part of a system. Model providers often publish worked examples of different use cases, and these can be the first stop for guidance when comparing and building an AI system. To help provide a baseline for RAG system performance and security, based on model providers’ guidance, we compared five tutorials published by Meta, Anthropic, Cohere, Mistral AI, and OpenAI covering the RAG use case and added the guardrails recommended by each provider. Following the providers’ guidance we built and tested five systems to answer questions about a particular document, based on Llama 3.1 B, Claude Haiku, Cohere Command-R, Mistral Large and GPT-4o. We then did a scan of the systems’ question answering ability and security posture using common evaluation techniques. The goal was not to claim any model is better than others but to observe differences in the systems exemplified by the model providers. Question answering performance showed relatively small differences while there were major differences in the security profiles. Overall, systems that included a dedicated LLM call for moderation fared better than those that included a “safety” prompt. 

There are many leaderboards and benchmarks of LLMs, but most consist of running a standard battery of tests on the base model itself. In production, however, it’s the system performance that matters, and the model is only one component of that system. Many other design choices can influence how well the system performs at its task.

Leave a Comment