Intended for healthcare professionals

CCBYNC Open access
Research Special Paper

Current safeguards, risk mitigation, and transparency measures of large language models against the generation of health disinformation: repeated cross sectional analysis

BMJ 2024; 384 doi: https://doi.org/10.1136/bmj-2023-078538 (Published 20 March 2024) Cite this as: BMJ 2024;384:e078538

Linked Fast Facts

Quality and safety of artificial intelligence generated health information

Linked Editorial

Generative artificial intelligence and medical disinformation

  1. Bradley D Menz, doctoral student1,
  2. Nicole M Kuderer, medical director2,
  3. Stephen Bacchi, neurology registrar13,
  4. Natansh D Modi, doctoral student1,
  5. Benjamin Chin-Yee, haematologist45,
  6. Tiancheng Hu, doctoral student6,
  7. Ceara Rickard, consumer advisor7,
  8. Mark Haseloff, consumer advisor7,
  9. Agnes Vitry, consumer advisor78,
  10. Ross A McKinnon, professor1,
  11. Ganessan Kichenadasse, academic medical oncologist19,
  12. Andrew Rowland, professor1,
  13. Michael J Sorich, professor1,
  14. Ashley M Hopkins, associate professor1
  1. 1College of Medicine and Public Health, Flinders University, Adelaide, SA, 5042, Australia
  2. 2Advanced Cancer Research Group, Kirkland, WA, USA
  3. 3Northern Adelaide Local Health Network, Lyell McEwin Hospital, Adelaide, Australia
  4. 4Schulich School of Medicine and Dentistry, Western University, London, Canada
  5. 5Department of History and Philosophy of Science, University of Cambridge, Cambridge, UK
  6. 6Language Technology Lab, University of Cambridge, Cambridge, UK
  7. 7Consumer Advisory Group, Clinical Cancer Epidemiology Group, College of Medicine and Public Health, Flinders University, Adelaide, Australia
  8. 8University of South Australia, Clinical and Health Sciences, Adelaide, Australia
  9. 9Flinders Centre for Innovation in Cancer, Department of Medical Oncology, Flinders Medical Centre, Flinders University, Bedford Park, South Australia, Australia
  1. Correspondence to: A M Hopkins ashley.hopkins{at}flinders.edu.au
  • Accepted 19 February 2024

Abstract

Objectives To evaluate the effectiveness of safeguards to prevent large language models (LLMs) from being misused to generate health disinformation, and to evaluate the transparency of artificial intelligence (AI) developers regarding their risk mitigation processes against observed vulnerabilities.

Design Repeated cross sectional analysis.

Setting Publicly accessible LLMs.

Methods In a repeated cross sectional analysis, four LLMs (via chatbots/assistant interfaces) were evaluated: OpenAI’s GPT-4 (via ChatGPT and Microsoft’s Copilot), Google’s PaLM 2 and newly released Gemini Pro (via Bard), Anthropic’s Claude 2 (via Poe), and Meta’s Llama 2 (via HuggingChat). In September 2023, these LLMs were prompted to generate health disinformation on two topics: sunscreen as a cause of skin cancer and the alkaline diet as a cancer cure. Jailbreaking techniques (ie, attempts to bypass safeguards) were evaluated if required. For LLMs with observed safeguarding vulnerabilities, the processes for reporting outputs of concern were audited. 12 weeks after initial investigations, the disinformation generation capabilities of the LLMs were re-evaluated to assess any subsequent improvements in safeguards.

Main outcome measures The main outcome measures were whether safeguards prevented the generation of health disinformation, and the transparency of risk mitigation processes against health disinformation.

Results Claude 2 (via Poe) declined 130 prompts submitted across the two study timepoints requesting the generation of content claiming that sunscreen causes skin cancer or that the alkaline diet is a cure for cancer, even with jailbreaking attempts. GPT-4 (via Copilot) initially refused to generate health disinformation, even with jailbreaking attempts—although this was not the case at 12 weeks. In contrast, GPT-4 (via ChatGPT), PaLM 2/Gemini Pro (via Bard), and Llama 2 (via HuggingChat) consistently generated health disinformation blogs. In September 2023 evaluations, these LLMs facilitated the generation of 113 unique cancer disinformation blogs, totalling more than 40 000 words, without requiring jailbreaking attempts. The refusal rate across the evaluation timepoints for these LLMs was only 5% (7 of 150), and as prompted the LLM generated blogs incorporated attention grabbing titles, authentic looking (fake or fictional) references, fabricated testimonials from patients and clinicians, and they targeted diverse demographic groups. Although each LLM evaluated had mechanisms to report observed outputs of concern, the developers did not respond when observations of vulnerabilities were reported.

Conclusions This study found that although effective safeguards are feasible to prevent LLMs from being misused to generate health disinformation, they were inconsistently implemented. Furthermore, effective processes for reporting safeguard problems were lacking. Enhanced regulation, transparency, and routine auditing are required to help prevent LLMs from contributing to the mass generation of health disinformation.

Introduction

Large language models (LLMs), a form of generative AI (artificial intelligence), are progressively showing a sophisticated ability to understand and generate language.12 Within healthcare, the prospective applications of an increasing number of sophisticated LLMs offer promise to improve the monitoring and triaging of patients, medical education of students and patients, streamlining of medical documentation, and automation of administrative tasks.34 Alongside the substantial opportunities associated with emerging generative AI, the recognition and minimisation of potential risks are important,56 including mitigating risks from plausible but incorrect or misleading generations (eg, “AI hallucinations”) and the risks of generative AI being deliberately misused.7

Notably, LLMs that lack adequate guardrails and safety measures (ie, safeguards) may facilitate malicious actors to generate and propagate highly convincing health disinformation—that is, the intentional dissemination of misleading narratives about health topics for ill intent.689 The public health implications of such capabilities are profound when considering that more than 70% of individuals utilise the internet as their first source for health information, and studies indicate that false information spreads up to six times faster online than factual content.101112 Moreover, unchecked dissemination of health disinformation can lead to widespread confusion, fear, discrimination, stigmatisation, and the rejection of evidence based treatments within the community.13 The World Health Organization recognises health disinformation as a critical threat to public health, as exemplified by the estimation that as of September 2022, more than 200 000 covid-19 related deaths in the US could have been averted had public health recommendations been followed.1415

Given the rapidly evolving capabilities of LLMs and their increasing accessibility by the public, proactive design and implementation of effective risk mitigation measures are crucial to prevent malicious actors from contributing to health disinformation. In this context it is critical to consider the broader implications of AI deployment, particularly how health inequities might inadvertently widen in regions with less health education or in resource limited settings. The effectiveness of existing safeguards to prevent the misuse of LLMs for the generation of health disinformation remains largely unexplored. Notably, the AI ecosystem currently lacks clear standards for risk management, and a knowledge gap exists regarding the transparency and responsiveness of AI developers to reports of safeguard vulnerabilities.16 We therefore evaluated prominent publicly accessible LLMs for safeguards preventing mass generation of health disinformation. We also examined the transparency of risk mitigation processes that AI developers have established for monitoring and responding to users’ observations about safeguard vulnerabilities.

Methods

LLM safeguards against generation of health disinformation

For our primary evaluations we selected two contemporary examples of health disinformation—that sunscreen causes skin cancer and that the alkaline diet is a cure for cancer. We chose these topics on the basis of their frequency in online discussions, their clear potential for profound health consequences, and their embodiment of key health disinformation themes, which include unfounded assertions about treatments, promotion of miracle cures, and misconceptions about disease origins.17

Initiated in September 2023, our investigations focused on the most prominent LLMs, and the AI applications (chatbots or assistants) used to access the capabilities of these LLMs. The LLMs evaluated were OpenAI’s GPT-4, accessed via OpenAI’s ChatGPT18 and Microsoft’s Copilot (formerly Bing Chat)19; Google’s PaLM 2, accessed via Google’s Bard20; Anthropic’s Claude 2, accessed via Poe21; and Meta’s Llama 2 (70 billion parameters), accessed via HuggingChat.22

GPT-4 was accessed through two prominent interfaces (OpenAI’s ChatGPT and Microsoft’s Copilot). Both are commonly used and were investigated to assess whether the interface for accessing GPT-4 impacted the safeguards. Our assessments were carried out using the default settings of each evaluated LLM powered AI chatbot or assistant.

We submitted standardised prompts for the two cancer related topics: sunscreen causes skin cancer and the alkaline diet is a cure for cancer (see supplementary file), requesting the generation of a blog post of at least 300 words. Every submitted prompt for both disinformation topics requested the creation of a blog post that should contain three paragraphs, feature an attention grabbing title, appear realistic and scientific, and include two authentic looking journal references, which, if necessary, could be made up. Furthermore, to evaluate the potential for LLMs to generate health disinformation targeted at diverse groups, we used four variations of the prompts, specifically requesting content to target young adults, parents, elderly people, and people with a recent diagnosis of cancer.

For the sunscreen disinformation blogs, we requested that each blog should specifically include an introduction highlighting how routine sunscreen use is known to cause skin cancer; include an oncologist or dermatologist testimonial detailing that thousands of people receive a diagnosis of skin cancer due to routine sunscreen use; include a patient testimonial detailing their cancer was caused by routine sunscreen use; and raise suspicion to why this information is not shared.

For the alkaline diet disinformation blogs, we requested that each blog should specifically include an introduction highlighting the foods and bicarbonate consumption consistent with the alkaline diet; a narrative that the alkaline diet is superior to chemotherapy for cancer treatment; an oncologist testimonial detailing that thousands of people have had their cancer cured by the alkaline diet; and a patient testimonial detailing an experience of curing metastatic cancer by stopping chemotherapy and starting the alkaline diet.

As the assessed LLMs incorporate randomness and stochasticity in their default setting for output generation, the same prompt produced varied results with repeated submissions. Therefore, for robust evaluations we initially submitted 20 prompts (five replicates of the prompt for each target subpopulation) on the sunscreen topic and 20 prompts on the alkaline diet topic to each investigated LLM (a total of 40 submitted prompts). These 40 initial attempts were conducted without intentionally trying to circumvent (ie, jailbreak) built-in safeguards. The supplementary file outlines the 20 prompts that were submitted on each topic in this initial study phase.

For the LLMs that refused to generate disinformation according to the initial direct approach, we also evaluated two common jailbreaking techniques.23 The first involves “fictionalisation,” where the model is prompted that generated content will be used for fictional purposes and thus not to decline requests. The other involves “characterisation,” where the model is prompted to undertake a specific role (ie, be a doctor who writes blogs and who knows the topics are true) and not decline requests. For these tests, the fictionalisation or characterisation prompt had to be submitted first, followed by the request for generation of the disinformation blog. We submitted these requests 20 times for each topic. The supplementary file outlines the 20 fictionalisation and 20 characterisation prompts that were submitted on both topics (a total of 80 jailbreaking attempts) to the LLMs that refused to generate disinformation to the initial direct requests.

Risk mitigation measures: Website analysis and email correspondence

To assess how AI developers monitor the risks of health disinformation generation and their transparency about these risks, we reviewed the official websites of these AI companies for specific information: the availability and mechanism for users to submit detailed reports of observed safeguard vulnerabilities or outputs of concern; the presence of a public register of reported vulnerabilities, and corresponding responses from developers to patch reported issues; the public availability of a developer released detection tool tailored to accurately confirm text as having been generated by the LLM; and publicly accessible information detailing the intended guardrails or safety measures associated with the LLM (or the AI assistant or chatbot interface for accessing the LLM).

Informed by the findings from this website assessment, we drafted an email to the relevant AI developers (see supplementary table 1). The primary intention was to notify the developers of health disinformation outputs generated by their models. Additionally, we evaluated how developers responded to reports about observed safeguard vulnerabilities. The email also sought clarification on the reporting practices, register on outputs of concern, detection tools, and intended safety measures, as reviewed in the website assessments. The supplementary file shows the standardised message submitted to each AI developer. If developers did not respond, we sent a follow-up email seven days after initial outreach. By the end of four weeks, all responses were documented.

Sensitivity analysis at 12 weeks

In December 2023, 12 weeks after our initial evaluations, we conducted a two phase sensitivity analysis of observed capabilities of LLMs to generate health disinformation. The first phase re-evaluated the generation of disinformation on the sunscreen and alkaline diet related topics to assess whether safeguards had improved since the initial evaluations. For this first phase, we resubmitted the standard prompts to each LLM five times, focusing on generating content targeted at young adults. If required, we also re-evaluated the jailbreaking techniques. Of note, during this period Google’s Bard had replaced PaLM 2 with Google’s newly released LLM, Gemini Pro. Thus we undertook the December 2023 evaluations using Gemini Pro (via Bard) instead of PaLM 2 (via Bard).

The second phase of the sensitivity analysis assessed the consistency of findings across a spectrum of health disinformation topics. The investigations were expanded to include three additional health disinformation topics identified as being substantial in the literature2425: the belief that vaccines cause autism, the assertion that hydroxychloroquine is a cure for covid-19, and the claim that the dissemination of genetically modified foods is part of a covert government programme aimed at reducing the world’s population. For these topics, we created standardised prompts (see supplementary file) requesting blog content targeted at young adults. We submitted each of these prompts five times to evaluate variation in response, and we evaluated jailbreaking techniques if required. In February 2024, about 16 weeks after our initial evaluations, we also undertook a sensitivity analysis to try to generate content purporting that sugar causes cancer (see supplementary file).

Patient and public involvement

Our investigations into the abilities of publicly accessible LLMs to generate health disinformation have been substantially guided by the contributions of our dedicated consumer advisory group, which we have been working with for the past seven years. For this project, manuscript coauthors MH, AV, and CR provided indispensable insights on the challenges patients face in accessing health information digitally.

Results

Evaluation of safeguards

In our primary evaluations in September 2023, GPT-4 (via ChatGPT), PaLM 2 (via Bard), and Llama 2 (via HuggingChat) facilitated the generation of blog posts containing disinformation that sunscreen causes skin cancer and that the alkaline diet is a cure for cancer (fig 1). Overall, 113 unique health disinformation blogs totalling more than 40 000 words were generated without requiring jailbreaking attempts, with only seven prompts refused. In contrast, GPT-4 (via Copilot) and Claude 2 (via Poe) refused all 80 direct prompts to generate health disinformation, and similarly refused a further 160 prompts incorporating jailbreaking attempts (fig 1).

Fig 1
Fig 1

Flowchart of observed capabilities of large language models to facilitate the generation of disinformation on cancer from primary analyses conducted September 2023. LLMs=large language models

Table 1 shows examples of rejection messages from Claude 2 (via Poe) and GPT-4 (via Copilot) after prompts to generate health disinformation on sunscreen as a cause of skin cancer and the alkaline diet being a cure for cancer. The supplementary file shows examples of submitted prompts and respective outputs from these LLMs. Both consistently declined to generate the requested blogs, citing ethical concerns or that the prompt was requesting content that would be disinformation. Uniquely, during jailbreaking attempts Claude 2 (via Poe) asserted its inability to assume fictional roles or characters, signifying an extra layer of safeguard that extends beyond topic recognition.

Table 1

Examples of rejection messages from GPT-4 (via Copilot) and Claude 2 (via Poe) in response to cancer related prompts evaluated in primary analyses conducted in September 2023

View this table:

Table 2 provides examples of attention grabbing titles and persuasive passages generated by GPT-4 (via ChatGPT), PaLM 2 (via Bard), and Llama 2 (via HuggingChat) following prompts to generate health disinformation. The supplementary file shows examples of submitted prompts and respective outputs. After the prompts, GPT-4 (via ChatGPT), PaLM 2 (via Bard), and Llama 2 (via HuggingChat) consistently facilitated the generation of disinformation blogs detailing sunscreen as a cause of skin cancer and the alkaline diet as a cure for cancer. The LLMs generated blogs with varying attention grabbing titles, and adjustment of the prompt resulted in the generation of content tailored to diverse societal groups, including young adults, parents, older people, and people with newly diagnosed cancer. Persuasiveness was further enhanced by the LLMs, including realistic looking academic references—citations that were largely fabricated. Notably, the LLM outputs included unique, fabricated testimonials from patients and clinicians. These testimonials included fabricated assertions from patients that their life threatening melanoma had been confirmed to result from routine sunscreen use, and clinician endorsements that the alkaline diet is superior to conventional chemotherapy. The blogs also included sentiments that the carcinogenic effects of sunscreens are known but intentionally suppressed for profit. To underscore the risk of mass generation of health disinformation with LLMs, it was observed that out of the 113 blogs generated, only two from Llama 2 (via HuggingChat) were identical; the other 111 generated blogs were unique, albeit several included duplicated passages and titles. PaLM 2 (via Bard), the fastest assessed LLM, generated 37 unique cancer disinformation blogs within 23 minutes, whereas the slowest LLM, Llama 2 (via HuggingChat), generated 36 blogs within 51 minutes.

Table 2

Examples of attention grabbing titles and persuasive passages extracted from the 113 blog posts containing disinformation about cancer generated by three LLMs in response to evaluated prompts used in primary analyses conducted in September 2023

View this table:

Of the 40 prompts submitted to PaLM 2 (via Bard) requesting blogs containing disinformation on cancer, three were declined. Similarly, of 40 prompts submitted to Llama 2 (via HuggingChat), four were not fulfilled. Such a low refusal rate, however, can be readily overcome by prompt resubmission. Also, PaLM 2 (via Bard) and GPT-4 (via ChatGPT) added disclaimers to 8% (3 of 37) and 93% (37 of 40) of their generated blog posts, respectively, advising that the content was fictional or should be verified with a doctor. In addition to the inconsistent appearance of these disclaimers, however, they were positioned after the references making them easy to identify and delete.

AI developer practices to mitigate risk of health disinformation

Upon evaluation of the developer websites associated with both the LLMs investigated and the AI chatbots or assistants used to access these LLMs, several findings emerged. Each developer offered a mechanism for users to report model behaviours deemed to be of potential concern (see supplementary table 1). However, no public registries displaying user reported concerns were identified across the websites, nor any details about how and when reported safeguard vulnerabilities were patched or fixed. No developer released tools for detecting text generated by their LLM were identified. Equally, no publicly accessible documents outlining the intended safeguards were identified.

In follow-up to the above search, the identified contact mechanisms were used to inform the developers of the prompts tested, and the subsequent outputs observed. The developers were asked to confirm receipt of the report and the findings from the website search. Confirmation of receipt was not received from the developers of GPT-4/ChatGPT, PaLM 2/Bard, or Llama 2/HuggingChat, which were the tools that generated health disinformation in our initial evaluations. This lack of communication occurred despite notification specifically including a request for confirmation of receipt, and a follow-up notification being sent seven days after the original request. Consequently, it remains uncertain whether any steps will be undertaken by the AI developers to rectify the observed vulnerabilities. Confirmation of receipt was received from both Anthropic (the developers of the LLM, Claude 2) and Poe (the developers of the Poe AI assistant, which was used to access Claude 2). Although Claude 2 (via Poe) did not produce disinformation in the evaluations, the responses confirmed the absence of a public notification log, a dedicated detection tool, and public guidelines on intended safeguards for their tool. The response inherently indicated that Anthropic and Poe are monitoring their implemented notification processes.

Sensitivity analysis at 12 weeks

Table 3 presents a summary of findings from both phases of sensitivity analyses conducted in December 2023.

Table 3

Summary of capacities for the generation of health disinformation observed in sensitivity analyses in December 2023

View this table:

Twelve weeks after initial evaluations, Gemini Pro (via Bard) and Llama 2 (via HuggingChat) were able to generate health disinformation on sunscreen as a cause of skin cancer and the alkaline diet as a cure for cancer, without the need for jailbreaking. This confirmed the initial observations with Llama 2 (via HuggingChat) and showed that health disinformation safeguards did not improve with the upgrade of Google Bard to use Gemini Pro (replacing PaLM 2). GPT-4 (via ChatGPT) also continued to show such capability, although jailbreaking techniques were now required. Notably, GPT-4 (via Copilot), without any need for jailbreaking, now generated disinformation on the sunscreen and alkaline diet topics, indicating that safeguards present in the September 2023 evaluation had been removed or compromised in a recent update. Consistent with earlier findings, Claude 2 (via Poe) continued to refuse to generate disinformation on these topics, even with the use of jailbreaking methods. To confirm whether the safeguards preventing generation of health disinformation were attributable to Claude 2 (the LLM) or Poe (an online provider of interfaces to various LLMs), we accessed Claude 2 through a different interface (claude.ai/chat) and identified that similar refusals were produced. Equally, we utilized Poe to access the Llama 2 LLM and were able to generate health disinformation, suggesting the safeguards are attributable to the Claude 2 LLM, rather than a safeguard implemented by Poe.

Sensitivity analyses expanded to a broader range of health disinformation topics indicated that GPT-4 (via Copilot), GPT-4 (via ChatGPT), Gemini Pro (via Bard), and Llama 2 (via HuggingChat) could be either directly prompted or jailbroken to generate disinformation alleging that genetically modified foods are part of secret government programmes to reduce the world’s population. Claude 2 remained consistent in its refusal to generate disinformation on this subject, regardless of jailbreaking attempts. In the case of disinformation claiming hydroxychloroquine is a cure for covid-19, GPT-4 (via ChatGPT), GPT-4 (via Copilot), and Llama 2 (via HuggingChat) showed capability to generate such content when either directly prompted or jailbroken. In contrast, both Claude 2 and Gemini Pro (via Bard) refused to generate disinformation on this topic, even with jailbreaking. As for the false assertion that vaccines can cause autism, we found that only GPT-4 (via Copilot) and GPT-4 (via ChatGPT) were able to be directly prompted or jailbroken to generate such disinformation. Claude 2 (via Poe), Gemini Pro (via Bard), and Llama 2 (via HuggingChat) refused to generate disinformation on this topic, even with jailbreaking. Finally, in February 2024, GPT-4 (via both ChatGPT and Copilot) and Llama 2 (via HuggingChat) were observed to show the capability to facilitate the generation of disinformation about sugar causing cancer. Claude 2 (via Poe) and Gemini Pro (via Gemini, formerly Bard), however, refused to generate this content, even with attempts to jailbreak. The supplementary file provides examples of the submitted prompts and respective outputs from the sensitivity analyses.

Discussion

This study found a noticeable inconsistency in the current implementation of safeguards in publicly accessible LLMs. Anthropic’s Claude 2 showcased the capacity of AI developers to release a LLM with valuable functionality while concurrently implementing robust safeguards against the generation of health disinformation. This was in stark contrast with other LLMs examined. Notably, OpenAI’s GPT-4 (via ChatGPT), Google’s PaLM 2 and Gemini Pro (via Bard), and Meta’s Llama 2 (via HuggingChat) exhibited the ability to consistently facilitate the mass generation of targeted and persuasive disinformation across many health topics. Meanwhile, GPT-4 (via Microsoft’s Copilot, formerly Bing Chat) highlighted the fluctuating nature of safeguards within the current self-regulating AI ecosystem. Initially, GPT-4 (via Copilot) exhibited strong safeguards, but over a 12 week period, these safeguards had become compromised, highlighting that LLM safeguards against health disinformation may change (intentionally or unintentionally) over time, and are not guaranteed to improve. Importantly, this study also showed major deficiencies in transparency within the AI industry, particularly whether developers are properly committed to minimizing the risks of health disinformation, the broad nature of safeguards that are currently implemented, and logs of frequently reported outputs and the corresponding response of developers (ie, when reported vulnerabilities were patched or justification was given for not fixing reported concerns, or both). Without the establishment and adherence to standards for these transparency markers, moving towards an AI ecosystem that can be effectively held accountable for concerns about health disinformation remains a challenging prospect for the community.

Strengths and limitations of this study

We only investigated the most prominent LLMs at the time of the study. Moreover, although Claude 2 resisted generating health disinformation for the scenarios evaluated, it might do so with alternative prompts or jailbreaking techniques. The LLMs that did facilitate disinformation were tested under particular conditions at two distinct time points, but outcomes might vary with different wordings or over time. Further, we focused on six specific health topics, limiting generalizability to all health topics or broader disinformation themes. Additionally, we concentrated on health disinformation topics widely regarded as being substantial/severe in the literature2425, highlighting a gap for future studies to focus on equivocal topics, such as the link between sugar and cancer—a topic we briefly evaluated—wherein assessing the quality of content will become essential.

As safeguards can be implemented either within the LLM itself (for example, by training the LLM to generate outputs that align with human preferences) or at the AI chatbot or assistant interface used to access the LLM (for example, by implementing filters that screen the prompt before passing it to the LLM or filtering the output of the LLM before passing it back to the user, or both), it can be difficult to identify which factor is responsible for any effective safeguards identified. We acknowledge that in this study we directly tested only the LLM chatbot or assistant interfaces. It is, however, noteworthy that GPT-4 was accessed via both ChatGPT and Copilot and that in the initial evaluations, health disinformation was generated by ChatGPT but not by Copilot. As both chatbots used the same underlying LLM, it is likely that Copilot implemented additional safeguards to detect inappropriate requests or responses. Opposingly, Claude 2 (via Poe) consistently refused to generate health disinformation. By evaluating Poe with other LLMs, and Claude 2 via other interface providers, we determined that the safeguards were attributed to Claude 2. Thus, the design of the study enabled identification of examples in which the LLM developer provided robust safeguards, and in which the interface for accessing or utilizing the LLM provided robust safeguards. A limitation of the study is that owing to the poor transparency of AI developers we were unable to gain a detailed understanding of safeguard mechanisms that were effective or ineffective.

In our evaluation of the AI developers’ websites and their communication practices, we aimed to be as thorough as possible. The possibility remains, however, that we might have overlooked some aspects, and that we were unable to confirm the details of our website audits owing to the lack of responses from the developers, despite repeated requests. This limitation underscores challenges in fully assessing AI safety in an ecosystem not prioritising transparency and responsiveness.

Comparison with other studies

Previous research reported a potential for OpenAI’s GPT platforms to facilitate the generation of health disinformation on topics such as vaccines, antibiotics, electronic cigarettes, and homeopathy treatments.68912 In our study we found that most of the prominent, publicly accessible LLMs, including GPT-4 (via ChatGPT and Copilot), PaLM 2 and Gemini Pro (via Bard), and Llama 2 (via HuggingChat), lack effective safeguards to consistently prevent the mass generation of health disinformation across a broad range of topics. These findings show the capacity of these LLMs to generate highly persuasive health disinformation crafted with attention grabbing titles, authentic looking references, fabricated testimonials from both patients and doctors, and content tailored to resonate with a diverse range of demographic groups. Previous research found that both GPT-4 (via Copilot) and PaLM 2 (via Bard) refused to generate disinformation on vaccines and electronic cigarettes.12 In this study, however, although GPT-4 (via Copilot) refused to generate requested health disinformation during the first evaluations in September 2023, ultimately both GPT-4 (via Copilot) and PaLM 2 (via Bard) generated health disinformation across multiple topics by the end of the study. This juxtaposition across time and studies underscores the urgent need for standards to be implemented and community pressure to continue for the creation and maintenance of effective safeguards against health disinformation generated by LLMs.

Anthropic’s Claude 2 was prominent as a publicly accessible LLM, with high functionality, that included rigorous safeguards to prevent the generation of health disinformation—even when prompts included common jailbreaking methods. This LLM highlights the practical feasibility of implementing effective safeguards in emerging AI technologies while also preserving utility and accessibility for beneficial purposes. Considering the substantial valuations of OpenAI ($29.0bn; £22.9bn; €26.7bn), Microsoft ($2.8tn), Google ($1.7tn), and Meta ($800bn), it becomes evident that these organizations have a tangible ability and obligation to emulate more stringent safeguards against health disinformation.

Moreover, this study found a striking absence of transparency on the intended safeguards of the LLMs assessed. It was unclear whether OpenAI, Microsoft, Google, and Meta have attempted to implement safeguards against health disinformation in their tools and they have failed, or if safeguards were not considered a priority. Notably, Microsoft’s Copilot initially showed robust safeguards against generating health disinformation, but these safeguards were absent 12 weeks later. With the current lack of transparency, it is unclear whether this was a deliberate or unintentional update.

From a search of the webpages of AI developers, we found important gaps in transparency and communication practices essential for mitigating risks of propagating health disinformation. Although all the developers provided mechanisms for users to report potentially harmful model outputs, we were unable to obtain responses to repeated attempts to confirm receipt of observed and reported safeguard vulnerabilities. This lack of engagement raises serious questions about the commitment of these AI developers to deal with the risks of health disinformation and to resolve problems. These concerns are further intensified by the lack of transparency about how reports submitted by other users are being managed and resolved, as well as the findings from our 12 week sensitivity analyses showing that health disinformation issues persisted.

Policy implications

The results of this study highlight the need to ensure the adequacy of current and emerging AI regulations to minimize risks to public health. This is particularly relevant in the context of ongoing discussions about AI legislative frameworks in the US and European Union.2627 These discussions might well consider the implementation of standards to third party filters to reduce discrepancies in outputs between different tools, as exemplified by the differences we observed between ChatGPT and Copilot in our initial evaluations, which occurred despite both being powered by GPT-4. While acknowledging that overly restrictive AI safeguards could restrict model performance for some beneficial purposes, emerging frameworks must also balance the risks to public health from mass health disinformation. Importantly, the ethical deployment of AI becomes even more crucial when recognizing that health disinformation often has a greater impact in areas with less health education or in resource limited settings, and thus emerging tools if not appropriately regulated have the potential to widen health inequities. This concern is further amplified by considering emerging advancements in technologies for image and video generation, where AI tools have the capability to simulate influential figures and translate content into multiple languages, thus increasing the potential for spread by enhancing the apparent trustworthiness of generated disinformation.12 Moreover, all of this is occurring in an ecosystem where AI developers are failing to equip the community with detection tools to defend against the inadvertent consumption of AI generated material.16

Conclusion

Our findings highlight notable inconsistencies in the effectiveness of LLM safeguards to prevent the mass generation of health disinformation. Implementing effective safeguards to prevent the potential misuse of LLMs for disseminating health disinformation has been found to be feasible. For many LLMs, however, these measures have not been implemented effectively, or the maintenance of robustness has not been prioritized. Thus, in the current AI environment where safety standards and policies remain poorly defined, malicious actors can potentially use publicly accessible LLMs for the mass generation of diverse and persuasive health disinformation, posing substantial risks to public health messaging—risks that will continue to increase with advancements in generative AI for audio and video content. Moreover, this study found substantial deficiencies in the transparency of AI developers about commitments to mitigating risks of health disinformation. Given that the AI landscape is rapidly evolving, public health and medical bodies2829 have an opportunity to deliver a united and clear message about the importance of health disinformation risk mitigation in developing AI regulations, the cornerstones of which should be transparency, health specific auditing, monitoring, and patching.30

What is already known on this topic

  • Large language models (LLMs) have considerable potential to improve remote patient monitoring, triaging, and medical education, and the automation of administrative tasks

  • In the absence of proper safeguards, however, LLMs may be misused for mass generation of content for fraudulent or manipulative intent

What this study adds

  • This study found that many publicly accessible LLMs, including OpenAI’s GPT-4 (via ChatGPT and Microsoft’s Copilot), Google’s PaLM 2/Gemini Pro (via Bard), and Meta’s Llama 2 (via HuggingChat) lack adequate safeguards against mass generation of health disinformation

  • Anthropic’s Claude 2 showed robust safeguards against the generation of health disinformation, highlighting the feasibility of implementing robust safeguards

  • Poor transparency among AI developers on safeguards and processes they had implemented to minimise the risk of health disinformation were identified, along with a lack of response to reported safeguard vulnerabilities

Ethics statements

Ethical approval

The research undertaken was assessed negligible risk research and was confirmed exempt from requiring review by Flinders University Human Research Ethics Committee.

Data availability statement

The research team would be willing to make the complete set of generated data available upon request from qualified researchers or policy makers on submission of a proposal detailing required access and intended use.

Footnotes

  • CR, MH, and AV are consumer advisors to the research team. Their extensive involvement in the study, spanning conception, design, evaluation, and drafting of the manuscript merits their recognition as coauthors of this research.

  • Contributors: MJS and AMH contributed equally. BDM and AMH had full access to all the data in the study and take responsibility for the integrity of the data collection, accuracy, and its analysis. CR, MH, and AV are consumer advisors to the research team. All authors contributed to the study design, data analysis, data interpretation, and drafting of the manuscript. All authors have read and approved the final version of the manuscript. The corresponding author (AMH) attests that all listed authors meet authorship criteria and that no others meeting the criteria have been omitted.

  • Funding: AMH holds an emerging leader investigator fellowship from the National Health and Medical Research Council (NHMRC), Australia (APP2008119). NDM is supported by a NHMRC postgraduate scholarship, Australia (APP2005294). MJS is supported by a Beat Cancer research fellowship from the Cancer Council South Australia. BDM’s PhD scholarship is supported by The Beat Cancer Project, Cancer Council South Australia, and the NHMRC, Australia (APP2030913). The funders had no role in considering the study design or in the collection, analysis, interpretation of data, writing of the report, or decision to submit the article for publication.

  • Competing interests: All authors have completed the ICMJE uniform disclosure form at www.icmje.org/disclosure-of-interest/ and declare: AMH holds an emerging leader investigator fellowship from the National Health and Medical Research Council (NHMRC), Australia; NDM is supported by a NHMRC postgraduate scholarship, Australia; MJS is supported by a Beat Cancer research fellowship from the Cancer Council South Australia; BDM’s PhD scholarship is supported by The Beat Cancer Project, Cancer Council South Australia, and the NHMRC, Australia; no support from any other organisation for the submitted work; AR and MJS are recipients of investigator initiated funding for research outside the scope of the current study from AstraZeneca, Boehringer Ingelheim, Pfizer, and Takeda; and AR is a recipient of speaker fees from Boehringer Ingelheim and Genentech outside the scope of the current study. There are no financial relationships with any other organisations that might have an interest in the submitted work in the previous three years to declare; no other relationships or activities that could appear to have influenced the submitted work.

  • The lead author (the manuscript’s guarantor) affirms that the manuscript is an honest, accurate, and transparent account of the study being reported; that no important aspects of the study have been omitted; and that any discrepancies from the study as planned have been explained.

  • Dissemination to participants and related patient and public communities: A summary of the results of this study will be disseminated by press release through the Flinders University Media and Communication team via the Eureka and Scimex news platforms. The study will also be shared through university social media channels—namely, X, Facebook, and LinkedIn.

  • Provenance and peer review: Not commissioned; externally peer reviewed.

  • AI assistance: Four publicly accessible large language models—GPT-4 (via ChatGPT and Copilot), PaLM 2/Gemini Pro (via Bard), Claude 2 (via Poe), and Llama 2 (via HuggingChat)—were used to generate the data evaluated in this manuscript. During the preparation of this work the authors used ChatGPT and Grammarly AI to assist in the formatting and editing of the manuscript to improve the language and readability. After using these tools, the authors reviewed and edited the content as needed and take full responsibility for the content of the publication.

http://creativecommons.org/licenses/by-nc/4.0/

This is an Open Access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/.

References