Reflection 70B is (currently) the world's top open-source LLM, trained with a new technique called Reflection-Tuning that teaches a LLM to detect mistakes in its reasoning and correct course.
All benchmarks tested have been checked for contamination by running LMSys's LLM Decontaminator. When benchmarking, we isolate the <output> and benchmark on solely that section.
Trained from Llama 3.1 70B Instruct, you can sample from Reflection 70B using the same code, pipelines, etc. as any other Llama model. It even uses the stock Llama 3.1 chat template format (though, we've trained in a few new special tokens to aid in reasoning and reflection).
During sampling, the model will start by outputting reasoning inside <thinking> and </thinking> tags, and then once it is satisfied with its reasoning, it will output the final answer inside <output> and </output> tags. Each of these tags are special tokens, trained into the model.
This enables the model to separate its internal thoughts and reasoning from its final answer, improving the experience for the user.