Two papers show that large language models, including ChatGPT, can pass the USMLE
Two artificial intelligence (AI) programs — including ChatGPT — have passed the U.S. Medical Licensing Examination (USMLE), according to two recent papers.
The papers highlighted different approaches to using large language models to take the USMLE, which is comprised of three exams: Step 1, Step 2 CK, and Step 3.
ChatGPT is an artificial intelligence (AI) search tool that mimics long-form writing based on prompts from human users. It was developed by OpenAI, and became popular after several social media posts showed potential uses for the tool in clinical practice, often with mixed results.
The first paper, published on medRxiv in December, investigated ChatGPT’s performance on the USMLE without any special training or reinforcement prior to the exams. According to Victor Tseng, MD, of Ansible Health in Mountain View, California, and colleagues, the results showed “new and surprising evidence” that this AI tool was up to the challenge.
Tseng and team noted that ChatGPT was able to perform at >50% accuracy across all of the exams, and even achieved 60% in most of their analyses. While the USMLE passing threshold does vary between years, the authors said that passing is approximately 60% most years.
“ChatGPT performed at or near the passing threshold for all three exams without any specialized training or reinforcement,” they wrote, noting that the tool was able to demonstrate “a high level of concordance and insight in its explanations.”
“These results suggest that large language models may have the potential to assist with medical education, and potentially, clinical decision-making,” they concluded.
The second paper, published on arXiv, also in December, evaluated the performance of another large language model, Flan-PaLM, on the USMLE. The key difference between the two models was that this model was heavily modified to prepare for the exams, using a collection of medical question-answering databases called the MultiMedQA, explained Vivek Natarajan, an AI researcher, and colleagues.
Flan-PaLM achieved 67.6% accuracy in answering the USMLE questions, which was about 17 percentage points higher than the previous best performance conducted using PubMed GPT.
Natarajan and team concluded that large language models “present a significant opportunity to rethink the development of medical AI and make it easier, safer and more equitable to use.”
ChatGPT, along with other AI programs, have been showing up as the subject — and sometimes as the co-author — of new research papers focused on testing the technology’s usefulness in medicine.
Of course, healthcare professionals have also expressed concerns over these developments, especially when ChatGPT is being listed as an author on research papers. A recent article from Nature highlighted the uneasiness from would-be colleagues and co-authors of the emerging technology.
One objection to the use of AI programs in research was based on whether they can be truly capable of making meaningful scholarly contributions to a paper, while another objection emphasized that AI tools can’t consent to be a co-author in the first place.
The editor of one of the papers that listed ChatGPT as an author said it was an error that would be corrected, according to the Nature article. Still, researchers have published several papers now touting these AI programs as useful tools in medical education, research, and even clinical decision making.
Natarajan and colleagues concluded in their paper that large language models could become a beneficial tool in medicine, but their first hope was that their findings would “spark further conversations and collaborations between patients, consumers, AI researchers, clinicians, social scientists, ethicists, policymakers and other interested people in order to responsibly translate these early research findings to improve healthcare.”
Courtesy: MedPageToday