Each pair of models is subject to 2 matches, with the positions of the respective responses swapped in the evaluation prompt. A model is considered a

Search code, repositories, users, issues, pull requests...

submited by
Style Pass
2024-05-07 22:00:07

Each pair of models is subject to 2 matches, with the positions of the respective responses swapped in the evaluation prompt. A model is considered a winner only if it wins both matches.

For each prompt, we then compute Bradley-Terry scores for the respective models using the same method as that used in the LMSYS Chatbot Arena Leaderboard. Finally, we normalize all scores to a scale from 0 to 1 for interoperability with other weighted ranking systems.

The embedding model was generated by first fine-tuning BAAI/bge-base-en-v1.5 with the intent categories from the dataset above, using contrastive learning with cosine similarity loss, and subsequently merging the resultant model with the base model at a 3:2 ratio.

Leave a Comment