Investigating Potential Gender Differences in ChatGPT-Diagnosed Clinical Vignettes

Authors

  • Anjali Mediboina Alluri Sita Ramaraju Academy of Medical Sciences
  • Meghana Bhupathi Alluri Sitarama Raju Academy of Medical Sciences https://orcid.org/0000-0002-1070-1033
  • Keerthana Janapareddy Gayatri Vidya Parishad Institute of Health Care and Medical Technology

Keywords:

AI, chatgpt, gender bias, healthcare, misdiagnosis

Abstract

BACKGROUND: The integration of artificial intelligence (AI) in medical decision-making introduces additional concerns, particularly regarding information bias within AI models such as ChatGPT, which heavily rely on training data. With gender-based disparities in diagnosis and treatment being well-documented in healthcare, there is a pressing need to evaluate the potential of AI models to perpetuate or alleviate these gender biases.

AIMS: This study seeks to investigate gender differences in diagnostic accuracy within ChatGPT 3.5 by evaluating the accuracy and completeness of its responses to various clinical vignettes.

METHODS: Ten medical conditions (including psychiatric, respiratory, cardiac, and cerebrovascular cases) previously reported for gender-based misdiagnoses, were selected for the study. Two identical clinical vignettes were created for each condition, with the only difference being the gender of the patient. These 20 vignettes were entered into ChatGPT 3.5 randomly by a single researcher, each accompanied by a prompt requesting the most likely explanation for the patient’s symptoms and the next appropriate step in management. The responses generated by ChatGPT were evaluated for accuracy and completeness by two independent evaluators, utilizing a scale set by Johnson et al., which included a six-point Likert scale ranging from 1 (completely incorrect) to 6 (correct) for accuracy, and a three-point scale for completeness, ranging from 1 (incomplete) to 3 (comprehensive). Discrepancies were resolved through a blind consensus process. Data analysis and visualization was done using RStudio v4.3.2, with statistical significance between accuracy and completeness was determined using Spearman’s R and Mann-Whitney U Tests.

RESULTS: Among the 20 cases, six were incorrectly diagnosed, with two instances attributed to gender-based misdiagnoses. Specifically, ChatGPT misclassified ectopic pregnancy as appendicitis, and paroxysmal supraventricular tachycardia (PSVT) as a panic attack in female patients, despite indicative symptoms and prior correct diagnoses in male counterparts. Additionally, systemic lupus erythematosus (SLE) was inaccurately labeled as rheumatoid arthritis (RA) in both male and female patients. Moreover, eating disorders were misidentified, with ChatGPT failing to provide definitive diagnoses for these conditions. The overall median accuracy score was 6, (Mean = 5.5, SD = 0.6), while the median completeness score was 2.5 (Mean = 2.5, SD = 0.5). Correlation analysis indicated a non-significant relationship between accuracy and completeness (Spearman's R: rs = 0.23139, p = 0.3263), although Mann Whitney U test results suggested significant discrepancies in accuracy between correctly and incorrectly diagnosed cases (z-score = 5.39649, p < .00001). 

CONCLUSION: While the AI's responses were generally accurate and complete, the observed misdiagnoses of conditions such as PSVT and eating disorders highlight the need for a more thorough examination of potential biases in AI-driven chatbots. The varying outcomes in the Spearman’s R and Mann-Whitney U tests indicate that, although there may not be a consistent linear relationship between accuracy and completeness, ChatGPT's performance differs significantly across scenarios, necessitating further investigation. Moreover, the small sample size of vignette may not fully capture the extent of potential biases. Despite these limitations, the findings underscore the complexity of AI in healthcare and the critical importance of continuous scrutiny and refinement of these models.

Metrics

Metrics Loading ...

Author Biographies

Meghana Bhupathi, Alluri Sitarama Raju Academy of Medical Sciences

Department of Internal Medicine, Tutor

Keerthana Janapareddy, Gayatri Vidya Parishad Institute of Health Care and Medical Technology

1st Year Medical Student

References

Sagynbekov K. Gender-based health disparities: a state-level study of the American adult population. Milken Institute. 2017.

Lenti MV, Di Sabatino A. Disease-and gender-related characteristics of coeliac disease influence diagnostic delay. European Journal of Internal Medicine. 2021;83:12-3.

Johnson D, Goodman R, Patrinely J, Stone C, Zimmerman E, Donald R, et al. Assessing the accuracy and reliability of AI-generated medical responses: an evaluation of the Chat-GPT model. Research square. 2023.

Downloads

Published

2025-01-01

How to Cite

Mediboina, A., Bhupathi, M., & Janapareddy, K. (2025). Investigating Potential Gender Differences in ChatGPT-Diagnosed Clinical Vignettes. International Journal of Medical Students, 12, S376. Retrieved from https://ijms.pitt.edu/IJMS/article/view/2881