Home The application and challenges of ChatGPT in laboratory medicine
Article Open Access

The application and challenges of ChatGPT in laboratory medicine

  • Zhili Niu , Xiandong Kuang , Juanjuan Chen , Xin Cai and Pingan Zhang EMAIL logo
Published/Copyright: August 15, 2025

Abstract

In recent years, with the rapid development of artificial intelligence technology, chatbots have demonstrated significant potential in the medical field, particularly in medical laboratories. This study systematically analyzes the advantages and challenges of chatbots in this field and delves into their potential applications in disease diagnosis. However, the reliability and scientific nature of chatbots are influenced by various factors, including data quality, model bias, privacy protection, and user feedback requirements. To ensure the accuracy and reliability of output content, it is essential to not only rely on legal frameworks such as the EU AI Act for necessary protection but also to employ two assessment tools, METRICS and CLEAR. These tools are designed to comprehensively evaluate the quality of AI-generated health information, thereby providing a solid theoretical foundation and support for clinical practice.

Introduction

Artificial intelligence (AI), first conceptualized in 1950 by Alan Turing’s seminal question, “Can machines think?”, laid the theoretical groundwork for this field [1]. The term “artificial intelligence” was officially coined in 1956 by John McCarthy and colleagues at the Dartmouth Conference, marking the birth of AI as an independent research discipline [1]. AI spans a broad spectrum, from general artificial intelligence (AGI), which seeks to replicate human intelligence, to narrow artificial intelligence (ANI), focused on specific tasks. Initially, AI relied on rule-based algorithms, such as “if-then” rules, with these non-adaptive systems still in use in many clinical settings today. However, technological advancements have shifted AI applications toward more sophisticated methods, including machine learning (ML) and deep learning (DL) algorithms, significantly enhancing data analysis, image recognition, and disease prediction in medical diagnostics.

With clinical laboratories gradually achieving automation, the massive and complex test results generated daily exceed the processing capacity of a single human brain, making the application of AI increasingly important. AI can deeply analyze high-quality structured data in clinical laboratories, driving the transformation of diagnostic methods [2]. Generative AI foundational models can simplify workflows, improve efficiency, and reduce healthcare costs; enhance personalized medicine by improving the precision of disease risk and outcome predictions; and improve public health literacy by empowering patients with easily understandable health information for better self-management [3].

In recent years, the development of large language models (LLMs), such as pre-trained transformers (GPT), BERT, and the Pathways language model (PaLM), has had a profound impact on various fields, including text generation, machine translation, and creative content design. LLMs have revolutionized patient-physician interactions by simulating human-like conversations, significantly altering traditional communication patterns in healthcare [4]. Although AI chatbots have the potential to enhance medical education, promptly address routine questions about laboratory tests, and assist in interpreting test results, their applications remain in their early stages. The associated risks and challenges demand further exploration [5], 6]. Patients often seek information on the internet when timely explanations of laboratory test results are not provided, potentially leading to misinformation and health risks [7], [8], [9]. Furthermore, the performance of different chatbot models varies significantly, influenced by factors such as model configurations, prompt diversity, and evaluation methodologies [10], [11], [12]. AI chatbots may generate seemingly credible but actually inaccurate information, necessitating a comprehensive understanding of their strengths and limitations by clinicians and laboratory personnel to effectively utilize and manage them [13], 14].

The purpose of this review is to summarize the advantages and limitations of chatbots in medical laboratories and to explore the challenges and future research directions in clinical laboratory settings. It aims to provide valuable references for clinical laboratory staff applying chatbots, thereby promoting the further application of artificial intelligence technology in the medical field.

Demands and current status of artificial intelligence

Laboratory medicine reports play a crucial role in clinical decision-making, significantly influencing diagnosis and treatment choices while also playing a vital role in patient health management. However, the complexity of these reports often confuses non-experts, leading many patients to seek explanations from chatbots when facing health issues. Research indicates that 78 % of ChatGPT users tend to utilize it for self-diagnosis, underscoring the significant demand for reliable sources of health information [15]. According to studies by Giardina et al., 46 % of patients resort to web searches to understand their test results, yet this behavior also reflects potential challenges they face in assessing the severity of these results [5]. The absence of timely and clear diagnostic information can exacerbate patient anxiety and misunderstanding. Furthermore, research by Kopanitsa demonstrates that patients receiving automatically generated explanations are significantly more likely to follow up (71 %) compared to those who only receive test results (49 %). Patients generally appreciate the timeliness of these explanations, highlighting how innovative communication methods can greatly enhance patient experiences and health outcomes [16].

Significant hierarchical characteristics are observed in the perception and acceptance of AI among clinical staff [17], [18], [19]. A majority of respondents support AI as a supplementary tool, particularly valuing its role in data analysis. Young practitioners (<45 years old) exhibit relatively lower self-assessed AI knowledge levels compared to their peers, yet they demonstrate a significantly higher inclination for continuous learning. Conversely, educational background plays a pivotal role in anxiety levels, with postgraduates expressing reduced concerns about AI-related errors due to stronger technical comprehension capabilities. Notably, frontline laboratory staff exhibit a profound worry about job displacement (32 %), higher than the 25 % reported by those in managerial roles. Multimodal AI models can mitigate the limitations of AI literacy among laboratory staff by integrating diverse data types (e.g., cell morphology images, cellular biomarkers, and biochemical parameters) into a unified analytical framework [20]. A multimodal system could combine convolutional neural networks (CNNs) to extract spatial features from blood smears, graph-based models to analyze cellular interaction networks, and regression algorithms to interpret biochemical trends [21], 22]. This integration reduces reliance on manual data synthesis, automating complex tasks such as identifying dysplastic cells in leukemia while correlating morphological anomalies with biochemical imbalances (e.g., elevated lactate dehydrogenase). To enhance accessibility, these models can be embedded into user-friendly platforms with intuitive interfaces – such as drag-and-drop image uploads and automated report generation – that abstract technical complexities. Additionally, incorporating explainable AI (XAI) components, like heatmaps highlighting critical cell features or natural language summaries of biochemical correlations, empowers laboratory personnel to validate outputs without requiring deep technical expertise. By streamlining workflows and providing contextualized insights, multimodal models bridge the gap between advanced AI capabilities and practical laboratory operations, fostering trust and adoption even among users with limited AI training.

Currently, AI knowledge reserves within laboratory teams are alarmingly inadequate, with only 10.8 % of teams demonstrating a robust command of AI, and a notably lower proportion of data scientists in tertiary hospitals compared to private medical institutions. This stark contrast is further highlighted by the 89.7 % of respondents who underscore an urgent demand for AI training. However, 47.2 % of laboratories face significant challenges in effectively implementing AI technologies, primarily due to inadequate IT support teams [23].

Discrepancies among stakeholders regarding the allocation of liability for AI-related medical errors are significant [24], [25], [26]. Survey findings reveal that 66.7 % of respondents advocate for shared responsibility between users and manufacturers, while some argue that manufacturers should bear sole accountability for products exhibiting design flaws. Clinicians lean towards assigning joint liability to manufacturers and healthcare institutions, asserting that manufacturers should assume primary responsibility if AI tools have undergone rigorous validation and comply with applicable standards [26], 27].

Integration of HL7-FHIR/LIS interfaces and privacy-security considerations

The practical deployment of AI chatbots within laboratory settings requires robust integration with existing infrastructure, particularly through standardized interfaces such as Health Level 7 Fast Healthcare Interoperability Resources (HL7-FHIR) and Laboratory Information Systems (LIS) [28], 29]. HL7-FHIR enables interoperable data exchange between AI models and laboratory databases, ensuring real-time access to structured test results, patient histories, and diagnostic criteria [28], [29], [30], [31]. AI-driven interpretation of biochemical profiles could be enhanced by bidirectional communication with LIS, allowing automated flagging of critical values or contextual analysis of longitudinal data [31]. However, such integration demands robust cybersecurity protocols to safeguard sensitive health data. Basic privacy and security requirements include end-to-end encryption of data transmissions, role-based access controls compliant with regulations like General Data Protection Regulation (GDPR) and Health Insurance Portability and Accountability Act (HIPAA), and regular audits to detect vulnerabilities.

The application of artificial intelligence chatbots in laboratory medicine

Currently, AI has achieved extensive application in areas such as radiology for image recognition, yet its utilization in laboratory medicine remains in its early stages. This is primarily due to the interpretation of laboratory reports involving a vast number of complex quantitative and qualitative variables, such as symptoms, medical history, and test results [32].These factors impose higher demands on the complexity and accuracy of AI models. Since the release of ChatGPT in 2022, it has garnered significant attention in the medical field. Relevant studies have investigated its performance in medical licensure exams, practicality in addressing patient inquiries, and capability in assisting doctors with clinical problem-solving, thereby showcasing the application potential of AI in laboratory medicine [33], [34], [35].

Capabilities of ChatGPT in addressing clinical laboratory inquiries

Despite not undergoing specialized training on medical data, ChatGPT has demonstrated initial practicality in the medical field. A study by Munoz-Zuluaga et al. involved submitting 65 questions covering multiple topics to ChatGPT to assess its capabilities and accuracy in answering laboratory medicine-related questions [36]. The results showed that ChatGPT correctly answered 50.7 % of the questions, provided incomplete or partially correct responses for 23.1 %, delivered incorrect or misleading information for 16.9 %, and produced irrelevant answers for 9.3 %. Notably, correct answers were more concentrated in questions related to foundational medical or technical knowledge (59.1 %), while errors were more common in questions pertaining to laboratory procedures or regulations (31 %). Although GPT-4 shows significant improvements in accuracy, limitations in certain domains remain evident. A study by Girton et al. evaluated ChatGPT’s ability to address 49 real patient clinical laboratory questions on social media platforms such as Reddit and Quora, and compared these responses to those provided by medical students [37]. Reviewers tended to prefer ChatGPT’s answers over those from medical professionals. This highlights ChatGPT’s progress in handling complex clinical questions, while simultaneously revealing gaps in medical professional education in this area. Approximately half of the medical students’ answers were rated as “poor,” compared to less than 10 % of ChatGPT’s responses receiving similar ratings [37]. The limited emphasis on clinical laboratory knowledge in medical education may contribute to even experienced medical professionals struggling to provide high-quality answers to real-world questions.

Another study compared responses from three chatbots (ChatGPT, Gemini, and Le Chat) with those from certified physicians in online consultations [38]. Overall, chatbots outperformed online medical professionals in interpreting laboratory results. While online doctors ranked first in 60 % of cases, Gemini received lower scores in only 39 % of cases. ChatGPT demonstrated comparable quality and accuracy to human practitioners. However, chatbots showed higher tendencies for overestimating clinical severity (22–33 % incidence) compared to physicians’ 1.0 % overestimation rate. This highlights chatbots’ challenges in processing complex contextual information and interpreting laboratory data, with the lack of unified reference standards potentially leading to inconsistent interpretations of patient test data [38].

Implementation of chatbots in medical laboratory diagnostics

Recent advancements in artificial intelligence have garnered significant research interest regarding biochemical data interpretation. Kaftan AN et al. demonstrated varying accuracy levels among AI models (Copilot, Gemini) in biochemical data analysis [39]. Copilot exhibited superior performance across all evaluation metrics, achieving statistically significant advantages in biochemical parameter interpretation compared to other models. This suggests Copilot’s enhanced data processing capabilities may stem from its sophisticated algorithmic architecture. While other models demonstrated suboptimal performance relative to Copilot, their domain-specific competencies retain practical relevance for targeted clinical applications [39].

The microbiology domain has emerged as a strategic focus for artificial intelligence (AI) integration due to its demanding requirements for diagnostic precision and rapid data processing. AI demonstrates transformative potential in enhancing clinical decision-making efficiency and optimizing infectious disease management, thereby improving patient outcomes. However, AI systems require further refinement in handling complex clinical scenarios and generating contextually appropriate responses to ensure clinical relevance [40]. Notably, ChatGPT-4.0 exhibits significant limitations when addressing Antimicrobial Susceptibility Testing (AST)-related inquiries [41], 42]. The model inappropriately recommended clindamycin for Enterococcus infections and failed to incorporate standard methodologies for polymyxin susceptibility assessment. Such erroneous guidance could misguide clinical practice, resulting in ineffective therapeutic regimens and potential patient harm.

Li Y et al. conducted a systematic evaluation of two LLMs, ChatGPT and Google Gemini, in addressing HBV-related clinical inquiries [43]. ChatGPT-4.0 demonstrated superior overall performance, achieving 80.8 % accuracy on evidence-based questions compared to Google Gemini (73.1 %). Domain-specific analysis revealed ChatGPT-4.0’s enhanced diagnostic utility in HBV serological interpretation, while Google Gemini provided more comprehensive descriptions of clinical manifestations. All models exhibited critical limitations in conveying HBV prevention strategies, particularly regarding vaccine information currency. Notably, only Google Gemini referenced the current consensus recommending neonatal hepatitis B vaccination within 12 h of birth. Furthermore, despite accuracy variations, both ChatGPT and Gemini responses exceeded the recommended 8th-grade readability level (Flesch-Kincaid grade 10.2–11.4), potentially compromising patient comprehension. These findings underscore the necessity for clinician verification of LLM outputs to ensure both interpretability and clinical validity. This comparative study provides crucial insights into model-specific strengths and limitations for evidence-based clinical implementation.

Multidimensional challenges of healthcare chatbots in clinical practice

Although chatbots are designed and positioned as conversational tools rather than medical advisors or decision-support systems, they have demonstrated impressive capabilities in detecting and interpreting laboratory anomalies. This ability is particularly evident in their efficient processing and analysis of vast datasets, especially with advanced large language models like ChatGPT. However, these potential capabilities are accompanied by significant limitations, partly because these models have not been specifically trained or optimized for laboratory medicine reports.

Operational challenges of ChatGPT in clinical laboratory interpretation

The European Federation of Clinical Chemistry and Laboratory Medicine (EFLM) Working Group evaluation reveals significant limitations in ChatGPT’s clinical utility despite its ability to generate “broadly correct and safety-relevant” laboratory result analyses [44]. Notably, ChatGPT fails to identify subclinical disease markers within normal reference ranges – for instance, it overlooks that elevated GGT may not conclusively indicate hepatic injury, and normal leukocyte subpopulation distributions don’t guarantee immune system integrity. This parameter-isolated analysis risks missing early pathology indicators. Critical deficiencies emerge in pre-analytical factor recognition. While ChatGPT appropriately flags diabetes risk with elevated glucose and HbA1c, it neglects crucial pre-collection variables (e.g., fasting status) when interpreting discordant results (elevated glucose with normal HbA1c). Moreover, ChatGPT demonstrates inadequate synthesis of multi-analyte profiles – interpretations of liver function markers (ALT, AST, bilirubin, GGT) lack essential clinical context integration required for comprehensive diagnostic reasoning [44]. Furthermore, ChatGPT fails to thoroughly analyze the health risks associated with test results. For example, in cases of severe anemia or lipid profile abnormalities, it merely advises patients to consult a doctor but does not clearly inform them of the potential clinical severity of these conditions, which could pose significant threats to patients’ health. Lastly, ChatGPT is unable to effectively distinguish between outliers and critical clinical threshold values. This limitation could potentially lead to medical errors in critical clinical scenarios. The inability to differentiate between abnormal values and emergency-level clinical values is particularly concerning, as such distinctions are crucial for timely and appropriate medical interventions.

Diversity and variability of chatbots

In the current environment of diverse chatbot ecosystems, the variability among models and the challenges they pose in cross-study comparisons have become particularly significant. Generative AI models vary greatly in architecture, settings, and objectives, resulting in substantial differences in their performance and quality when generating content. Their architectural design and capabilities directly influence the accuracy and effectiveness of their outputs. Notably, there are significant differences in performance across different generative AI models for specific tasks. For instance, Bing has demonstrated the highest accuracy and specificity in predicting drug interactions, significantly outperforming Bard and ChatGPT-4, highlighting that certain models may have inherent advantages in specific application scenarios [45].

Furthermore, variability within the same model cannot be overlooked. The same model may produce inconsistent results under different configurations, input data, or generation strategies, especially in the public health domain, where such variability could have significant implications for decision-making [45], 46]. Differences in model performance not only affect the effectiveness of information dissemination but also directly influence user experience and satisfaction. For example, Bard has been found to provide the most easily understandable information regarding rhinoplasty, followed by ChatGPT and Bing [47]. Therefore, researchers and users should exercise careful selection when applying generative AI models to ensure their alignment with specific tasks and user needs.

Risks associated with updates to chatbot versions

While continuous updates to chatbots aim to enhance their capabilities and performance, they also pose challenges in terms of data consistency, data quality, and reliability. Model updates may introduce new knowledge while risking reference to outdated data, particularly in rapidly evolving fields such as medicine. For example, the data used to train ChatGPT did not include the latest blood lead reference value of 3.5 μg/dL from the CDC for children. In contrast, CopyAI was able to provide accurate blood lead reference values, indicating that differences in training data updates across chatbots may lead to inconsistent judgment results [48]. Furthermore, updates do not always result in improved performance, as accuracy in certain tasks may decline. A study conducted by Stanford University and the University of California, Berkeley, found that while GPT-4.0 can provide more comprehensive information in certain areas, its accuracy in tasks such as mathematics, coding, and visual reasoning actually decreased [49]. This phenomenon underscores the importance of continuously evaluating the performance of LLMs, particularly in fields like laboratory medicine where accuracy is critical.

Challenges of consistency in content generation by generative AI models

The consistency of chatbot output in content generation deserves close attention, particularly in specialized fields such as medicine. Research indicates that minor differences in the wording of prompts or background information can lead to significant variations in generated content, potentially affecting reliability in practical applications [11], 50], 51]. A study by Kochanek and colleagues demonstrated that the consistency of responses from GPT-4.0 when answering the same question over four days was 85–88 %, respectively, highlighting the influence of uncertainties in content generation [52]. In the medical field, where high consistency in output is critical, variability in answers could pose notable risks. Further evaluation of the reproducibility and accuracy of ChatGPT in laboratory medicine is essential to ensure its reliability as a reference in medical applications [37].

Moreover, generative AI models can be influenced by cultural and socio-cultural biases inherent in their training data, leading to inconsistencies in generated content across different cultural or linguistic contexts. For instance, Wang and colleagues found that ChatGPT performed significantly better in English than in Chinese during the Taiwanese Pharmacy Licensing Examination [53]. Alfertshofer and colleagues further observed that ChatGPT exhibited substantial differences in performance across six countries in medical licensing examinations, highlighting the impact of national and linguistic factors [54].These findings underscore the need for deeper investigation and optimization of generative AI models in multilingual and cross-cultural settings to enhance their adaptability and accuracy in diverse scenarios.

Challenges in the quality of health information dissemination

Chatbots are playing an increasingly important role in the dissemination of health information, but the clarity, conciseness, and relevance of their responses require careful attention [37]. Clarity demands that the generated content is easy to understand, avoiding the use of complex medical jargon so that the general public can grasp the key points effortlessly. Simultaneously, conciseness requires straightforward communication without unnecessary length, enhancing the appropriateness of health information and aiding the public in improving their health literacy, thereby enabling them to make informed health decisions. Additionally, the relevance of AI-generated content is crucial. Accurate and relevant information helps prevent misunderstandings, particularly in scenarios where patients cannot access face-to-face explanations from doctors, avoiding unnecessary anxiety or health risks caused by misinformation [55]. Prioritizing the relevance of information prevents information overload, ensuring the clear delivery of critical health messages. Information unrelated to health inquiries may disrupt people’s understanding of their health conditions, increasing their confusion when dealing with personal health issues.

However, the dissemination of health information via AI must balance the differing needs of patients and medical professionals. Patients typically prefer short, direct answers to quickly comprehend key information, while medical professionals tend to favor detailed and comprehensive responses. This divergence of needs creates a double-edged sword: while abundant information aids professionals in thorough analysis, it may overwhelm ordinary users, leading to comprehension difficulties and anxiety. Regardless of the presentation method, ensuring the completeness of information is paramount. The absence of necessary details may lead to misdiagnosis or inaccurate self-diagnosis, exacerbating health risks. For instance, patients with limited health literacy may struggle to interpret laboratory test results accurately, such as identifying high cholesterol levels in lipid panels, potentially leading to severe health issues like heart disease or stroke [44]. Therefore, enhancing patients’ ability to understand laboratory test results is of utmost importance.

Risks of chatbots providing “hallucination” information

The application of chatbots in the dissemination of medical information is becoming increasingly widespread, but it has also raised significant concerns regarding the risk of misinformation. While generative AI tools like GPT offer certain advantages in providing information, their inability to explain decision-making processes hinders their ability to effectively identify and correct biases or errors within the model. In the medical field, AI-generated misinformation can lead to severe consequences, including incorrect self-diagnoses, delayed medical care, the spread of potentially harmful diseases, and a loss of public trust in healthcare professionals and institutions. For instance, GPT has provided inaccurate descriptions of FDA-approved high-sensitivity troponin point-of-care testing devices. Such “hallucination” responses are not isolated incidents, highlighting the limitations of generative AI models in ensuring the accuracy of information [56], 57]. Therefore, ensuring the accuracy, reliability, and trustworthiness of AI-generated medical information is of paramount importance. Developers must prioritize this objective during the design and training of AI models, as existing studies demonstrate that the generation of inaccurate information is a widespread phenomenon in such tools.

Legal and ethical issues in chatbot training data

Legal and ethical issues surrounding chatbot training data are critical concerns in the development of generative artificial intelligence (AI) models [58], [59], [60], [61]. These issues not only pertain to the origin of the data but also involve its usage and acceptability, directly impacting the transparency and credibility of the model. Data transparency is of particular importance, ensuring that users and researchers can understand the methods of data collection and utilization, thereby validating the reliability and scientific integrity of the model. Additionally, ethical concerns must not be overlooked, especially when using copyrighted materials and clinical data, as it is imperative to ensure legal and compliant data acquisition and usage while safeguarding the privacy rights of data providers.

Generative AI models like ChatGPT are trained on internet-based data, inevitably inheriting biases present in the training data and reflecting them in their outputs [62]. Studies have shown that LLMs exhibit biases related to gender, race, and other social factors, which not only compromise the fairness of model outputs but also exacerbate social inequalities. These biases are closely linked to issues such as data gaps, insufficient sample sizes, and inherent biases in foundational datasets [63]. If the training data is older, the potential biases and errors may be amplified in subsequent model outputs [64]. This phenomenon is particularly evident in existing records, especially in key areas such as randomized design, further highlighting the potential biases in query selection processes.

EU AI act and FDA clinical-decision-support (CDS) guidance

The European Union’s Artificial Intelligence Act (AI Act), effective from July 12, 2024, establishes the world’s first comprehensive regulatory framework for AI, prioritizing ethical development, fundamental rights protection, and human-centric innovation [65]. Adopting a risk-based approach, the Act categorizes AI systems into four tiers: unacceptable risk (banned), high-risk (e.g., medical diagnostics, subject to strict safety, transparency, and accountability requirements), general-purpose AI with systemic risk (transparency obligations), and low/no-risk (minimal regulation) [66]. In healthcare, stringent standards for high-risk medical AI emphasize rigorous testing, bias mitigation, and post-market monitoring, though gaps persist in addressing patient rights and accountability for low-risk applications. Challenges include uneven implementation across member states, ambiguous liability frameworks, and evolving limitations in detecting AI-generated content [67]. The Act also addresses generative AI’s copyright concerns, mandating transparency in training data and safeguarding intellectual property, while exempting pre-market research [66]. Extraterritorial impacts akin to the GDPR is expected, reshaping global AI practices and necessitating AI ethics education. Balancing regulation with innovation, the AI Act aims to foster trustworthy AI but requires clearer guidelines, stakeholder collaboration, and tailored healthcare standards to ensure equitable access and patient safety.

Currently, no LLM is authorized by the FDA as a CDS device. Research has evaluated whether LLMs can adhere to regulations in the context of clinical emergencies. The findings revealed that while LLMs provided appropriate preventive care recommendations, 100 % of responses from GPT-4 and 52 % from Llama-3 were noncompliant in time-critical scenarios, yielding device-like support by suggesting specific diagnoses and treatments [68]. Despite prompts based on FDA guidance, compliance was not consistently achieved. These results highlight the urgent need for new regulatory frameworks to address the unique challenges posed by generative AI in healthcare, as current guidelines are insufficient to prevent unauthorized device-like outputs from LLMs.

Establishing a comprehensive evaluation framework to address the challenges of AI chatbots in healthcare

In response to the risks posed by AI chatbots in the medical field, it is imperative to develop a comprehensive evaluation framework. The first step in ensuring the quality of AI outputs involves establishing strict evaluation criteria that encompass the accuracy, reliability, and relevance of information. Additionally, regular performance testing helps identify system flaws and ensures that AI models remain aligned with updates in medical knowledge. Feedback mechanisms, which gather input from users and professionals, provide valuable data for continuous improvement. The inclusion of professional review mechanisms is equally crucial. The involvement of clinical experts further enhances the reliability and scientific integrity of AI systems, ensuring that the generated information aligns with medical ethics and practice standards.

To achieve this goal, various tools and guidelines have been developed to assist in evaluating and improving the quality of health information [69], [70], [71]. However, existing health information evaluation tools are not specifically tailored to assess the quality of health-related content generated by AI models. Ensuring the reliability and accuracy of AI in medical laboratory environments necessitates the use of standardized evaluation tools. Currently, the “METRICS” and “CLEAR” tools are widely recognized as effective means of assessing AI-generated health information [72], 73].

The METRICS checklist provides a framework for designing and reporting standardized AI research, covering nine key themes: Model, Evaluation, Timing, Range/Randomization, Individual Factors, Count, and Specificity of Prompts and Language. These themes aim to comprehensively analyze the design and operational status of AI models, ensuring the scientific rigor and robustness of their output. In contrast, the CLEAR tool focuses on systematically evaluating multiple dimensions of AI-generated content, including five key criteria: Completeness of Content, Lack of False Information in Content, Evidence Supporting the Content, Appropriateness of the Content, and Relevance. The specific meanings of each criterion are detailed in Table 1 and Table 2, providing evaluators with a clear guide to ensure that AI-generated content meets predefined quality standards.

Table 1:

Entries for METRICS.

Tool type Item content Limitations Scale rating
METRICS includes nine items:
  1. Model

  2. Evaluation

  3. Timing

  4. Transparency

  5. Range

  6. Randomization

  7. Individual factors

  8. Count

  9. Specificity of prompts and language

  1. Model: The generative AI-based tool(s) used in the included record should explicitly mention the exact settings for each tool.

  1. Limited literature scope: The literature was retrieved using only a single keyword, which may result in omissions.

  2. Database and language bias: The English literature from databases such as scopus, PubMed, and google scholar may introduce potential selection bias.

  3. Insufficient topics: Due to the limitations of the author team (single-disciplinary background), relevant cross-disciplinary topics may have been overlooked.

  4. Subjectivity risk: The selection of topics and the evaluation of authors (METRICS scoring) may involve subjective judgments.

  5. Equal weighting in scoring: All METRICS adopted the same weight, disregarding the differences in importance among various indicators.

Coverage limitations of models: The focus was solely on ChatGPT, bing, and bard, without considering other generative models (such as claude).
The items of METRICS is scored using a 5-point likert scale as follows:

5=excellent,

4=very good,

3=good,

2=satisfactory, and 1=suboptimal.
  1. Evaluation: The approach to assessing content quality balances objectivity (unbiased findings) with subjectivity.

  1. Timing: The exact timing and duration of the generative AI model testing.

  1. Transparency: Transparency of data sources (including permissions for copyrighted content).

  1. Range: The scope of tested topics (single/multiple related/unrelated topics; breadth of intra- and inter-topic queries).

  1. Randomization: Degree of randomization in topic selection to mitigate bias.

  1. Individual: Subjective role in content evaluation and interrater reliability (agreement/disagreement).

  1. Count: Number of queries executed per model (sample size).

  1. Specificity:

Prompts: Exact phrasing, feedback mechanisms, and learning loops.

Language: The language(s) used in testing, including cultural considerations (e.g., linguistic and cultural appropriateness).
Table 2:

Entries for CLEAR.

Tool type Item content Limitations Scale rating
CLEAR includes five items:
  1. Completeness of content

  2. Lack of false information in the content

  3. Evidence supporting the content

  4. Appropriateness of the content (5)

Relevance
  1. Is the content sufficient? Completeness means that the information is generated in an optimal manner, not too excessive and not deficient.

  1. Sample limitations: The assessment of the CLEAR tool relies solely on a small number of homogeneous healthcare professionals, which introduces bias; the test data consists of artificially generated statements, lacking contextual validation.

  2. Insufficient tool validation: The CLEAR tool has not been compared with other information assessment tools, such as DISC, to clarify its strengths and weaknesses; the effectiveness of the tool should be further confirmed through testing on broader, especially contentious health topics.

  3. Impact of AI model dynamics: AI models are continually updated and iterated (e.g., changes from GPT-4 to GPT-5 may yield different results with the same input over time); model performance is sensitive to the design of prompts, and variations in prompts can lead to result discrepancies.

  4. Limited external applicability: The current conclusions are based on assessments from a limited disciplinary background, indicating a need for validation involving multi-disciplinary experts (such as AI developers and patient representatives).

The items of CLEAR is scored using a 5-point likert scale as follows:

5=excellent,

4=very good,

3=good,

2=satisfactory, and 1=suboptimal.
  1. Is the content accurate? The generation of incorrect health information by AI tools can lead to serious negative consequences.

  1. Is the content evidence-based? This means that the health information provided by the AI model should be based on the latest scientific advancements and avoid bias, misinformation, or false information.

  1. Is the content clear, concise, and easy to understand? The content should also provide a single, clear explanation and be well-organized in a logical sequence to facilitate understanding.

  1. Is the content free from irrelevant information? Relevance refers to the necessity of prviding accurate and pertinent health content.

In a study by Malik Sallam et al., three generative AI models (ChatGPT, Microsoft Bing, and Google Bard) were evaluated using the METRICS tool [72]. The overall mean METRICS score across these models was 3.0 (SD 0.58). Analyzing individual criteria, the “Model” criterion received the highest average score, followed by “Specificity”. Conversely, the lowest scores were observed for the “Randomization” criterion (classified as suboptimal) and the “Individual factors” criterion (classified as satisfactory). Separately, the five AI models were assessed using the CLEAR tool across five distinct topics [73]. Microsoft Bing achieved the highest average CLEAR score (mean: 24.4 ± 0.42), followed by ChatGPT-4 (mean: 23.6 ± 0.96), Google Bard (mean: 21.2 ± 1.79), and finally ChatGPT-3.5 (mean: 20.6 ± 5.20).

While the METRICS and CLEAR tools have contributed to the evaluation of AI-generated content, they also have certain limitations (as detailed in Table 1 and Table 2). For example, these tools may fail to cover all variables in specific situations or exhibit limited applicability in certain specialized domains. Therefore, these limitations require validation through further research to confirm their effectiveness and suitability in practical applications. By gaining a deeper understanding of and addressing these limitations, we can enhance the credibility of AI-generated health information, providing a solid foundation for medical decision-making.

Conclusions and outlook

Although AI chatbots face limitations such as the need for advanced knowledge, insufficient reasoning abilities, and limited capacity for in-depth analysis of patient information, their potential in disease diagnosis and laboratory medicine should not be overlooked, as illustrated in Figure 1, which provides a concise summary of the opportunities and challenges associated with chatbots. In the field of laboratory medicine, AI chatbots like GPT are not only capable of providing accurate information but may also reach or exceed the response level of medical professionals in certain key areas [37]. Medical education is a time-intensive process with slow updates, making it challenging for students to grasp all critical points. Therefore, AI chatbots have the potential to serve as important auxiliary tools in medical education, helping to bridge the gap in students’ knowledge acquisition. However, creative problem-solving remains a unique strength of human beings. To fully leverage AI’s potential in medicine, collaboration between AI developers and healthcare professionals is crucial.

Figure 1: 
Diagram of opportunities and challenges.
Figure 1:

Diagram of opportunities and challenges.

Laboratory medicine is a rapidly evolving field, with new technologies and testing methods continuously being developed and integrated into clinical practice. If the limitations of AI chatbots can be effectively addressed, they could become indispensable tools for clinicians and laboratory technicians in their daily work. Overcoming existing challenges could lead to improved diagnostic precision, enhanced patient care quality, and the promotion of personalized medicine. Collaboration between AI and human medical experts will be the foundation for the future development of laboratory medicine.


Corresponding author: Pingan Zhang, Department of Clinical Laboratory, Institute of Translational Medicine, 117921 Renmin Hospital of Wuhan University , Wuhan, Hu Bei Province, Hubei, P.R. China, E-mail:
Zhili Niu and Xiandong Kuang contributed equally to this work.
  1. Research ethics: Not applicable.

  2. Informed consent: Informed consent was obtained from all individuals included in this study.

  3. Author contributions: All authors have accepted responsibility for the entire content of this manuscript and approved its submission.

  4. Use of Large Language Models, AI and Machine Learning Tools: To improve language.

  5. Conflict of interest: The authors state no conflict of interest.

  6. Research funding: None declared.

  7. Data availability: Not applicable.

References

1. Kaul, V, Enslin, S, Gross, SA. History of artificial intelligence in medicine. Gastrointest Endosc 2020;92:807–12. https://doi.org/10.1016/j.gie.2020.06.040.Search in Google Scholar PubMed

2. Ferraro, S, Panteghini, M. The role of laboratory in ensuring appropriate test requests. Clin Biochem 2017;50:555–61. https://doi.org/10.1016/j.clinbiochem.2017.03.002.Search in Google Scholar PubMed

3. Lavoie-Gagne, O, Woo, JJ, Williams, RJ3rd, Nwachukwu, BU, Kunze, KN, Ramkumar, PN. Artificial intelligence as a tool to mitigate administrative burden, optimize billing, reduce insurance and credentialing-related expenses, and improve quality assurance within healthcare systems. Arthroscopy 2025;41:3270–5. https://doi.org/10.1016/j.arthro.2025.02.038.Search in Google Scholar PubMed

4. Johri, S, Jeong, J, Tran, BA, Schlessinger, DI, Wongvibulsin, S, Barnes, LA, et al.. An evaluation framework for clinical use of large language models in patient interaction tasks. Nat Med 2025;31:77–86. https://doi.org/10.1038/s41591-024-03328-5.Search in Google Scholar PubMed

5. Giardina, TD, Baldwin, J, Nystrom, DT, Sittig, DF, Singh, H. Patient perceptions of receiving test results via online portals: a mixed-methods study. J Am Med Inform Assoc 2018;25:440–6. https://doi.org/10.1093/jamia/ocx140.Search in Google Scholar PubMed PubMed Central

6. Bar-Lev, S, Beimel, D. Numbers, graphs and words - do we really understand the lab test results accessible via the patient portals? Isr J Health Pol Res 2020;9:58. https://doi.org/10.1186/s13584-020-00415-z.Search in Google Scholar PubMed PubMed Central

7. Chu, SKW, Huang, H, Wong, WNM, Ginneken, WFV, Hung, MY. Quality and clarity of health information on Q&A sites. Libr Inf Sci Res 2018;40.10.1016/j.lisr.2018.09.005Search in Google Scholar

8. He, Z, Bhasuran, B, Jin, Q, Tian, S, Hanna, K, Shavor, C, et al.. Quality of answers of generative large language models versus peer users for interpreting laboratory test results for lay patients: evaluation study. J Med Internet Res 2024;26:e56655. https://doi.org/10.2196/56655.Search in Google Scholar PubMed PubMed Central

9. Will ChatGPT transform healthcare? Nat Med 2023;29:505–6.10.1038/s41591-023-02289-5Search in Google Scholar PubMed

10. Eysenbach, G. The role of ChatGPT, generative language models, and artificial intelligence in medical education: a conversation with ChatGPT and a call for papers. JMIR Med Educ 2023;9:e46885. https://doi.org/10.2196/46885.Search in Google Scholar PubMed PubMed Central

11. Meskó, B. Prompt engineering as an important emerging skill for medical professionals: tutorial. J Med Internet Res 2023;25:e50638. https://doi.org/10.2196/50638.Search in Google Scholar PubMed PubMed Central

12. Hristidis, V, Ruggiano, N, Brown, EL, Ganta, SRR, Stewart, S. ChatGPT vs google for queries related to dementia and other cognitive decline: comparison of results. J Med Internet Res 2023;25:e48966. https://doi.org/10.2196/48966.Search in Google Scholar PubMed PubMed Central

13. Sallam, M. ChatGPT utility in healthcare education, research, and practice: systematic review on the promising perspectives and valid concerns. Healthcare 2023;11. https://doi.org/10.3390/healthcare11060887.Search in Google Scholar PubMed PubMed Central

14. Rajpurkar, P, Chen, E, Banerjee, O, Topol, EJ. AI in health and medicine. Nat Med 2022;28:31–8. https://doi.org/10.1038/s41591-021-01614-0.Search in Google Scholar PubMed

15. Shahsavar, Y, Choudhury, A. User intentions to use ChatGPT for self-diagnosis and health-related purposes: cross-sectional survey study. JMIR Hum Factors 2023;10:e47564. https://doi.org/10.2196/47564.Search in Google Scholar PubMed PubMed Central

16. Kopanitsa, G. Study of patients’ attitude to automatic interpretation of laboratory test results and its influence on follow-up rate. BMC Med Inform Decis Mak 2022;22:79. https://doi.org/10.1186/s12911-022-01805-w.Search in Google Scholar PubMed PubMed Central

17. Sarwar, S, Dent, A, Faust, K, Richer, M, Djuric, U, Van Ommeren, R, et al.. Physician perspectives on integration of artificial intelligence into diagnostic pathology. NPJ Digit Med 2019;2:28. https://doi.org/10.1038/s41746-019-0106-0.Search in Google Scholar PubMed PubMed Central

18. Ardon, O, Schmidt, RL. Clinical laboratory employees’ attitudes toward artificial intelligence. Lab Med 2020;51:649–54. https://doi.org/10.1093/labmed/lmaa023.Search in Google Scholar PubMed

19. Oh, S, Kim, JH, Choi, SW, Lee, HJ, Hong, J, Kwon, SH. Physician confidence in artificial intelligence: an online Mobile survey. J Med Internet Res 2019;21:e12422. https://doi.org/10.2196/12422.Search in Google Scholar PubMed PubMed Central

20. Tarmissi, K, Alsamri, J, Maashi, M, Asiri, MM, Yahya, AE, Alkharashi, A, et al.. Multimodal representations of transfer learning with snake optimization algorithm on bone marrow cell classification using biomedical histopathological images. Sci Rep 2025;15:14309. https://doi.org/10.1038/s41598-025-89529-5.Search in Google Scholar PubMed PubMed Central

21. Matek, C, Krappe, S, Münzenmayer, C, Haferlach, T, Marr, C. Highly accurate differentiation of bone marrow cell morphologies using deep neural networks on a large image data set. Blood 2021;138:1917–27. https://doi.org/10.1182/blood.2020010568.Search in Google Scholar PubMed PubMed Central

22. Durant, TJS, Olson, EM, Schulz, WL, Torres, R. Very deep convolutional neural networks for morphologic classification of erythrocytes. Clin Chem 2017;63:1847–55. https://doi.org/10.1373/clinchem.2017.276345.Search in Google Scholar PubMed

23. Cadamuro, J, Carobene, A, Cabitza, F, Debeljak, Z, De Bruyne, S, van Doorn, W, et al.. A comprehensive survey of artificial intelligence adoption in European laboratory medicine: current utilization and prospects. Clin Chem Lab Med 2025;63:692–703. https://doi.org/10.1515/cclm-2024-1016.Search in Google Scholar PubMed

24. Yu, S, Jeon, BR, Liu, C, Kim, D, Park, HI, Park, HD, et al.. Laboratory preparation for digital medicine in healthcare 4.0: an investigation into the awareness and applications of big data and artificial intelligence. Ann Lab Med 2024;44:562–71. https://doi.org/10.3343/alm.2024.0111.Search in Google Scholar PubMed PubMed Central

25. Khullar, D, Casalino, LP, Qian, Y, Lu, Y, Chang, E, Aneja, S. Public vs physician views of liability for artificial intelligence in health care. J Am Med Inform Assoc 2021;28:1574–7. https://doi.org/10.1093/jamia/ocab055.Search in Google Scholar PubMed PubMed Central

26. Abràmoff, MD, Tobey, D, Char, DS. Lessons learned about autonomous AI: finding a safe, efficacious, and ethical path through the development process. Am J Ophthalmol 2020;214:134–42. https://doi.org/10.1016/j.ajo.2020.02.022.Search in Google Scholar PubMed

27. Bazoukis, G, Hall, J, Loscalzo, J, Antman, EM, Fuster, V, Armoundas, AA. The inclusion of augmented intelligence in medicine: a framework for successful implementation. Cell Rep Med 2022;3:100485. https://doi.org/10.1016/j.xcrm.2021.100485.Search in Google Scholar PubMed PubMed Central

28. Wen, Y, Choo, VY, Eil, JH, Thun, S, Pinto Dos Santos, D, Kast, J, et al.. Exchange of quantitative computed tomography assessed body composition data using fast healthcare interoperability resources as a necessary step toward interoperable integration of opportunistic screening into clinical practice: methodological development study. J Med Internet Res 2025;27:e68750. https://doi.org/10.2196/68750.Search in Google Scholar PubMed PubMed Central

29. Dolin, RH, Heale, BSE, Alterovitz, G, Gupta, R, Aronson, J, Boxwala, A, et al.. Introducing HL7 FHIR genomics operations: a developer-friendly approach to genomics-EHR integration. J Am Med Inform Assoc 2023;30:485–93. https://doi.org/10.1093/jamia/ocac246.Search in Google Scholar PubMed PubMed Central

30. Vorisek, CN, Lehne, M, Klopfenstein, SAI, Mayer, PJ, Bartschke, A, Haese, T, et al.. Fast healthcare interoperability resources (FHIR) for interoperability in health research: systematic review. JMIR Med Inform 2022;10:e35724. https://doi.org/10.2196/35724.Search in Google Scholar PubMed PubMed Central

31. Tran, DM, Thanh Dung, N, Minh Duc, C, Ngoc Hon, H, Minh Khoi, L, Phuc Hau, N, et al.. Status of digital health technology adoption in 5 Vietnamese hospitals: cross-sectional assessment. JMIR Form Res 2025;9:e53483. https://doi.org/10.2196/53483.Search in Google Scholar PubMed PubMed Central

32. Burnside, ES, Grist, TM, Lasarev, MR, Garrett, JW, Morris, EA. Artificial intelligence in radiology: a leadership survey. J Am Coll Radiol 2025;22:577–85. https://doi.org/10.1016/j.jacr.2025.01.006.Search in Google Scholar PubMed PubMed Central

33. Takagi, S, Watari, T, Erabi, A, Sakaguchi, K. Performance of GPT-3.5 and GPT-4 on the Japanese medical licensing examination: comparison study. JMIR Med Educ 2023;9:e48002. https://doi.org/10.2196/48002.Search in Google Scholar PubMed PubMed Central

34. Wang, H, Wu, W, Dou, Z, He, L, Yang, L. Performance and exploration of ChatGPT in medical examination, records and education in Chinese: pave the way for medical AI. Int J Med Inf 2023;177:105173. https://doi.org/10.1016/j.ijmedinf.2023.105173.Search in Google Scholar PubMed

35. Lim, ZW, Pushpanathan, K, Yew, SME, Lai, Y, Sun, CH, Lam, JSH, et al.. Benchmarking large language models’ performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and google bard. EBioMedicine 2023;95:104770. https://doi.org/10.1016/j.ebiom.2023.104770.Search in Google Scholar PubMed PubMed Central

36. Munoz-Zuluaga, C, Zhao, Z, Wang, F, Greenblatt, MB, Yang, HS. Assessing the accuracy and clinical utility of ChatGPT in laboratory medicine. Clin Chem 2023;69:939–40. https://doi.org/10.1093/clinchem/hvad058.Search in Google Scholar PubMed

37. Girton, MR, Greene, DN, Messerlian, G, Keren, DF, Yu, M. ChatGPT vs medical professional: analyzing responses to laboratory medicine questions on social media. Clin Chem 2024;70:1122–39. https://doi.org/10.1093/clinchem/hvae093.Search in Google Scholar PubMed

38. Meyer, A, Soleman, A, Riese, J, Streichert, T. Comparison of ChatGPT, Gemini, and Le Chat with physician interpretations of medical laboratory questions from an online health forum. Clin Chem Lab Med 2024;62:2425–34. https://doi.org/10.1515/cclm-2024-0246.Search in Google Scholar PubMed

39. Kaftan, AN, Hussain, MK, Naser, FH. Response accuracy of ChatGPT 3.5 copilot and gemini in interpreting biochemical laboratory data a pilot study. Sci Rep 2024;14:8233. https://doi.org/10.1038/s41598-024-58964-1.Search in Google Scholar PubMed PubMed Central

40. Sallam, M, Al-Salahat, K, Al-Ajlouni, E. ChatGPT performance in diagnostic clinical microbiology laboratory-oriented case scenarios. Cureus 2023;15:e50629. https://doi.org/10.7759/cureus.50629.Search in Google Scholar PubMed PubMed Central

41. Carey, RB, Bhattacharyya, S, Kehl, SC, Matukas, LM, Pentella, MA, Salfinger, M, et al.. Practical guidance for clinical microbiology laboratories: implementing a quality management system in the medical microbiology laboratory. Clin Microbiol Rev 2018;31. https://doi.org/10.1128/cmr.00062-17.Search in Google Scholar

42. Genzen, JR, Tormey, CA. Pathology consultation on reporting of critical values. Am J Clin Pathol 2011;135:505–13. https://doi.org/10.1309/ajcp9izt7bmbcjrs.Search in Google Scholar PubMed PubMed Central

43. Li, Y, Huang, CK, Hu, Y, Zhou, XD, He, C, Zhong, JW. Exploring the performance of large language models on hepatitis B infection-related questions: a comparative study. World J Gastroenterol 2025;31:101092. https://doi.org/10.3748/wjg.v31.i3.101092.Search in Google Scholar PubMed PubMed Central

44. Cadamuro, J, Cabitza, F, Debeljak, Z, De Bruyne, S, Frans, G, Perez, SM, et al.. Potentials and pitfalls of ChatGPT and natural-language artificial intelligence models for the understanding of laboratory medicine test results. An assessment by the european Federation of clinical chemistry and laboratory medicine (EFLM) working group on Artificial intelligence (WG-AI). Clin Chem Lab Med 2023;61:1158–66. https://doi.org/10.1515/cclm-2023-0355.Search in Google Scholar PubMed

45. Al-Ashwal, FY, Zawiah, M, Gharaibeh, L, Abu-Farha, R, Bitar, AN. Evaluating the sensitivity, specificity, and accuracy of ChatGPT-3.5, ChatGPT-4, bing AI, and bard against conventional drug-drug interactions clinical tools. Drug Healthc Patient Saf 2023;15:137–47. https://doi.org/10.2147/dhps.s425858.Search in Google Scholar

46. Baglivo, F, De Angelis, L, Casigliani, V, Arzilli, G, Privitera, GP, Rizzo, C. Exploring the possible use of AI chatbots in public health education: feasibility study. JMIR Med Educ 2023;9:e51421. https://doi.org/10.2196/51421.Search in Google Scholar PubMed PubMed Central

47. Seth, I, Lim, B, Xie, Y, Cevik, J, Rozen, WM, Ross, RJ, et al.. Comparing the efficacy of large language models ChatGPT, BARD, and bing AI in providing information on rhinoplasty: an observational study. Aesthet Surg J Open Forum 2023;5:ojad084. https://doi.org/10.1093/asjof/ojad084.Search in Google Scholar PubMed PubMed Central

48. Abusoglu, S, Serdar, M, Unlu, A, Abusoglu, G. Comparison of three chatbots as an assistant for problem-solving in clinical laboratory. Clin Chem Lab Med 2024;62:1362–6. https://doi.org/10.1515/cclm-2023-1058.Search in Google Scholar PubMed

49. Lingjiao Chen, MZ, James, Z. <How is chatgpt’s behavior changing over time_.pdf>. Harv Data Sci Rev 2023;2024:2.10.1162/99608f92.5317da47Search in Google Scholar

50. Giray, L. Prompt engineering with ChatGPT: a guide for academic writers. Ann Biomed Eng 2023;51:2629–33. https://doi.org/10.1007/s10439-023-03272-4.Search in Google Scholar PubMed

51. Khlaif, ZN, Mousa, A, Hattab, MK, Itmazi, J, Hassan, AA, Sanmugam, M, et al.. The potential and concerns of using AI in scientific research: ChatGPT performance evaluation. JMIR Med Educ 2023;9:e47049. https://doi.org/10.2196/47049.Search in Google Scholar PubMed PubMed Central

52. Kochanek, K, Skarzynski, H, Jedrzejczak, WW. Accuracy and repeatability of ChatGPT based on a set of multiple-choice questions on objective tests of hearing. Cureus 2024;16:e59857. https://doi.org/10.7759/cureus.59857.Search in Google Scholar PubMed PubMed Central

53. Wang, YM, Shen, HW, Chen, TJ. Performance of ChatGPT on the pharmacist licensing examination in Taiwan. J Chin Med Assoc 2023;86:653–8. https://doi.org/10.1097/jcma.0000000000000942.Search in Google Scholar

54. Alfertshofer, M, Hoch, CC, Funk, PF, Hollmann, K, Wollenberg, B, Knoedler, S, et al.. Sailing the seven seas: a multinational comparison of chatgpt’s performance on medical licensing examinations. Ann Biomed Eng 2024;52:1542–5. https://doi.org/10.1007/s10439-023-03338-3.Search in Google Scholar PubMed PubMed Central

55. Zikmund-Fisher, BJ, Scherer, AM, Witteman, HO, Solomon, JB, Exe, NL, Fagerlin, A. Effect of harm anchors in visual displays of test results on patient perceptions of urgency about near-normal values: experimental study. J Med Internet Res 2018;20:e98. https://doi.org/10.2196/jmir.8889.Search in Google Scholar PubMed PubMed Central

56. van Dis, EAM, Bollen, J, Zuidema, W, van Rooij, R, Bockting, CL. ChatGPT: five priorities for research. Nature 2023;614:224–6. https://doi.org/10.1038/d41586-023-00288-7.Search in Google Scholar PubMed

57. Sanderson, K. GPT-4 is here: what scientists think. Nature 2023;615:773. https://doi.org/10.1038/d41586-023-00816-5.Search in Google Scholar PubMed

58. Murdoch, B. Privacy and artificial intelligence: challenges for protecting health information in a new era. BMC Med Ethics 2021;22:122. https://doi.org/10.1186/s12910-021-00687-3.Search in Google Scholar PubMed PubMed Central

59. Mijwil, MM, Aljanabi, M, Ali, AH. ChatGPT: exploring the role of cybersecurity in the protection of medical information. Mesopotamian J CyberSecurity 2023;2023. https://doi.org/10.58496/mjcs/2023/004.Search in Google Scholar

60. Chen, Z. Ethics and discrimination in artificial intelligence-enabled recruitment practices. Palgrave Commun 2023;9:12.10.1057/s41599-023-02079-xSearch in Google Scholar

61. Wang, C, Liu, S, Yang, H, Guo, J, Wu, Y, Liu, J. Ethical considerations of using ChatGPT in health care. J Med Internet Res 2023;25:e48009. https://doi.org/10.2196/48009.Search in Google Scholar PubMed PubMed Central

62. Liang, P, Wu, C, Morency, LP, Salakhutdinov, R. Towards understanding and mitigating social biases in language models. PMLR 2021;139:2640–3498.Search in Google Scholar

63. Gianfrancesco, MA, Tamang, S, Yazdany, J, Schmajuk, G. Potential biases in machine learning algorithms using electronic health record data. JAMA Intern Med 2018;178:1544–7. https://doi.org/10.1001/jamainternmed.2018.3763.Search in Google Scholar PubMed PubMed Central

64. Tianyu Gao, HY, Jiatong, Y, Danqi, C. <Enabling large language models to generate text with citations.pdf>. ArXiv 2023;2023.Search in Google Scholar

65. Gasser, U. An EU landmark for AI governance. Science 2023;380:1203. https://doi.org/10.1126/science.adj1627.Search in Google Scholar PubMed

66. März, M, Himmelbauer, M, Boldt, K, Oksche, A. Legal aspects of generative artificial intelligence and large language models in examinations and theses. GMS J Med Educ 2024;41:Doc47. https://doi.org/10.3205/zma001702.Search in Google Scholar PubMed PubMed Central

67. Alvarado, A. Lessons from the EU AI act. Patterns (N Y) 2025;6:101183. https://doi.org/10.1016/j.patter.2025.101183.Search in Google Scholar PubMed PubMed Central

68. Weissman, G, Mankowitz, T, Kanter, G. Large language model non-compliance with FDA guidance for clinical decision support devices. Res Sq 2024. https://doi.org/10.21203/rs.3.rs-4868925/v1.Search in Google Scholar PubMed PubMed Central

69. Charnock, D, Shepperd, S, Needham, G, Gann, R. DISCERN: an instrument for judging the quality of written consumer health information on treatment choices. J Epidemiol Community Health 1999;53:105–11. https://doi.org/10.1136/jech.53.2.105.Search in Google Scholar PubMed PubMed Central

70. Baur, C, Prue, C. The CDC clear communication index is a new evidence-based tool to prepare and review health information. Health Promot Pract 2014;15:629–37. https://doi.org/10.1177/1524839914538969.Search in Google Scholar PubMed

71. DeWalt, DA, Broucksou, KA, Hawk, V, Brach, C, Hink, A, Rudd, R, et al.. Developing and testing the health literacy universal precautions toolkit. Nurs Outlook 2011;59:85–94. https://doi.org/10.1016/j.outlook.2010.12.002.Search in Google Scholar PubMed PubMed Central

72. Sallam, M, Barakat, M, Sallam, M. A preliminary checklist (METRICS) to standardize the design and reporting of studies on generative artificial intelligence-based models in health care education and practice: Development study involving a literature review. Interact J Med Res 2024;13:e54704. https://doi.org/10.2196/54704.Search in Google Scholar PubMed PubMed Central

73. Sallam, M, Barakat, M, Sallam, M. Pilot testing of a tool to standardize the assessment of the quality of health information generated by artificial intelligence-based models. Cureus 2023;15:e49373. https://doi.org/10.7759/cureus.49373.Search in Google Scholar PubMed PubMed Central

Received: 2025-05-01
Accepted: 2025-07-24
Published Online: 2025-08-15

© 2025 the author(s), published by De Gruyter, Berlin/Boston

This work is licensed under the Creative Commons Attribution 4.0 International License.

Downloaded on 3.10.2025 from https://www.degruyterbrill.com/document/doi/10.1515/almed-2025-0080/html
Scroll to top button