A Preliminary Study of o1 in Medicine: Are We Closer to an AI Doctor ?

submited by
Style Pass
2024-09-25 01:30:05

Figure 3: Our evaluation pipeline has different evaluation (a) aspects containing various tasks. We collect multiple (b) datasets for each task, combining with various (c) prompt strategies to evaluate latest (d) language models. We leverage a comprehensive set of (e) evaluations to present a holistic view of model progress in the medical domain.

Table 1: Accuracy (Acc.) or F1 results on 4 tasks across 2 aspects. Model performances with * are taken from Wu et al. (2024) as the reference. We also present the average score (Average) of each metric in the table

Table 2: BLEU-1 (B-1) and ROUGE-1 (R-1) results on 3 tasks across 2 aspects. We use the gray background to highlight o1 results. We also present the average score (Average) of each metric

Figure 4: Comparison of the answers from o1 and GPT-4 for a question from NEJM. o1 provides a more concise and accurate reasoning process compared to GPT-4.

Leave a Comment
Related Posts