Artificial intelligence (AI) startup Mendel and the University of Massachusetts Amherst (UMass Amherst) have jointly published a study detecting hallucinations in AI-generated medical summaries.

The study evaluated medical summaries generated by two large language models (LLMs), GPT-4o and Llama-3. It categorises the hallucinations into five categories based on where they occur in the structure of medical notes – patient information, patient history, symptoms / diagnosis / surgical procedures, medicine-related instructions, and follow-up.

The study found that summaries created by AI models can “generate content that is incorrect or too general according to information in the source clinical notes”, which is called faithfulness hallucination. AI hallucinations are a well-documented phenomenon. Google’s use of AI in its search engine has prompted some absurd responses, such as “eating one small rock per day” and “adding non-toxic glue to pizza to stop it from sticking”. However, in cases of medical summaries these hallucinations can undermine the reliability and accuracy of the medical records.

The pilot study prompted GPT-4o and Llama-3 to create 500-word summaries of 50 detailed medical notes. Research found that GPT-4o had 21 summaries with incorrect information and 50 summaries with generalised information, while Llama-3 had 19 and 47, respectively. The researchers noted that Llama-3 tended to report details “as is” in its summaries whilst GPT-40 made “bold, two-step reasoning statements” that can lead to hallucinations.

The use of AI has been increasing in recent years, GlobalData expects the global revenue for AI platforms across healthcare to reach an estimated $18.8bn by 2027. There have also been calls to integrate AI with electronic health records to support clinical decision-making.

GlobalData is the parent company of Clinical Trials Arena.

How well do you really know your competitors?

Access the most comprehensive Company Profiles on the market, powered by GlobalData. Save hours of research. Gain competitive edge.

Company Profile – free sample

Thank you!

Your download email will arrive shortly

Not ready to buy yet? Download a free sample

We are confident about the unique quality of our Company Profiles. However, we want you to make the most beneficial decision for your business, so we offer a free sample that you can download by submitting the below form

By GlobalData
Visit our Privacy Policy for more information about our services, how we may use, process and share your personal data, including information of your rights in respect of your personal data and how you can unsubscribe from future marketing communications. Our services are intended for corporate subscribers and you warrant that the email address submitted is your corporate email address.

The UMass Amherst and Mendel study establishes a need for a hallucination detection system to boost reliability and accuracy of the AI-generated summaries. The research  found that it took 92 minutes on average for a well-trained clinician to label an AI-generated summary, which can be expensive. To overcome this, the research team employed Mendel’s Hypercube system to detect hallucinations.

It also found that while Hypercube tended to overestimate the number of hallucinations. Furthermore, it detected hallucinations that are otherwise missed by human experts. The research team proposed the use of Hypercube system as “an initial hallucination detection step, which can then be integrated with human expert review to enhance overall detection accuracy”.