Original research••

Large language models for data extraction from unstructured and semi-structured electronic health records: a multiple model performance evaluation

•,,,,,,,.

...

Abstract

Objectives We aimed to evaluate the performance of multiple large language models (LLMs) in data extraction from unstructured and semi-structured electronic health records.

Methods 50 synthetic medical notes in English, containing a structured and an unstructured part, were drafted and evaluated by domain experts, and subsequently used for LLM-prompting. 18 LLMs were evaluated against a baseline transformer-based model. Performance assessment comprised four entity extraction and five binary classification tasks with a total of 450 predictions for each LLM. LLM-response consistency assessment was performed over three same-prompt iterations.

Results Claude 3.0 Opus, Claude 3.0 Sonnet, Claude 2.0, GPT 4, Claude 2.1, Gemini Advanced, PaLM 2 chat-bison and Llama 3-70b exhibited an excellent overall accuracy >0.98 (0.995, 0.988, 0.988, 0.988, 0.986, 0.982, 0.982, and 0.982, respectively), significantly higher than the baseline RoBERTa model (0.742). Claude 2.0, Claude 2.1, Claude 3.0 Opus, PaLM 2 chat-bison, GPT 4, Claude 3.0 Sonnet and Llama 3-70b showed a marginally higher and Gemini Advanced a marginally lower multiple-run consistency than the baseline model RoBERTa (Krippendorff’s alpha value 1, 0.998, 0.996, 0.996, 0.992, 0.991, 0.989, 0.988, and 0.985, respectively).

Discussion Claude 3.0 Opus, Claude 3.0 Sonnet, Claude 2.0, GPT 4, Claude 2.1, Gemini Advanced, PaLM 2 chat bison and Llama 3-70b performed the best, exhibiting outstanding performance in both entity extraction and binary classification, with highly consistent responses over multiple same-prompt iterations. Their use could leverage data for research and unburden healthcare professionals. Real-data analyses are warranted to confirm their performance in a real-world setting.

Conclusion Claude 3.0 Opus, Claude 3.0 Sonnet, Claude 2.0, GPT 4, Claude 2.1, Gemini Advanced, PaLM 2 chat-bison and Llama 3-70b seem to be able to reliably extract data from unstructured and semi-structured electronic health records. Further analyses using real data are warranted to confirm their performance in a real-world setting.

What is already known on this topic

Even though several large language models (LLMs) are available nowadays, to this point, no comprehensive evaluation and comparison of their performance in data extraction from electronic health records exists.

What this study adds

18 LLMs were evaluated against a baseline transformer-based model, with Claude 3.0 Opus, Claude 3.0 Sonnet, Claude 2.0, GPT 4, Claude 2.1, Gemini Advanced, PaLM 2 chat bison and Llama 3-70b exhibiting an excellent performance in data extraction from synthetic electronic health records.

How this study might affect research, practice or policy

Several LLMs seem to be able to reliably extract data from electronic health records. Real-data analyses, carefully handling sensitive patient data, are warranted to confirm LLM performance in a real-world setting.

Introduction

Data are essential for clinical practice, research and quality assessment in the healthcare sector, but their collection from medical records is a time-consuming, strenuous and error-prone task.1–3 Increased time spent on data collection from unstructured and semi-structured medical records, such as referral letters, discharge summaries, radiology and pathology reports, may cause a shift in physician time allocation, leading to decreased time devoted to patient care. Artificial intelligence is increasingly being used in healthcare, with applications in disease diagnosis, outcome prediction and clinical decision-making assistance, which could increase the time physicians spend with patients, and subsequently improve the quality of healthcare.4 5

Large language models (LLMs) have progressed significantly in various domains and diverse tasks, such as named entity recognition, natural language inference and question-answering.6 7 Generative Pre-trained Transformer 4 (GPT 4) is a state-of-the-art multimodal LLM developed by OpenAI (Open AI, San Francisco, CA), which since its release in March 2023, has received significant attention and media coverage for its remarkable performance across diverse natural language processing tasks.6 8 9 This exceptional performance of GPT 4 has led to growing interest in LLMs and their potential in various applications involving natural language processing in numerous domains, including medicine. In a comparative study of LLM-based zero-shot inference and task-specific supervised classification of breast cancer pathology reports, GPT 4 performed either significantly better than or as well as the best supervised model, demonstrating the potential of LLMs to speed up the execution of clinical natural language processing studies.10

The PALISADE checklist, a good practices report of a Task Force of the International Society for Pharmacoeconomics and Outcomes includes the task of entity extraction from electronic health records as one of the valuable use cases of artificial intelligence in health economics and outcomes research.11 The natural language processing capabilities of LLMs could provide a potential way of extracting data from unstructured and semi-structured medical records and facilitate their import into structured clinical databases.

Aim of this study is to evaluate the performance of multiple LLMs in data extraction tasks involving entity extraction and binary classification, in order to assess their potential for data extraction from unstructured and semi-structured electronic health records.

Methods

Data

Due to the sensitive nature of patient medical records, 50 synthetic patient medical notes were drafted in English and used for LLM prompting. Working with synthetic text, approval of the research project by the local ethics committee was unnecessary.

Each medical note comprised two parts. The first part was structured and contained a numerical patient identifier, the hospital admission date and discharge date and the patient’s EuroSCORE 2 value (European System for Cardiac Operative Risk Evaluation). The second part contained unstructured text with information about each patient’s postoperative course.

Each LLM was requested to provide a binary classification of each medical note for the presence or absence of five randomly selected postoperative complications after cardiac surgery: stroke, reoperation due to bleeding or cardiac tamponade, pacemaker implantation, atrial fibrillation and pleural tap. In order to avoid class imbalance, maximise the number of positive and negative classes (presence or absence of complication, respectively), mitigate the limitations associated with the low number of drafted synthetic medical notes, and better assess the discriminatory ability of each LLM for both positive and negative classes, the medical notes were drafted in such a way that half of them were positive and half of them were negative for each of the five examined postoperative complications (prevalence of 50% for each examined complication). Random number generation was used to assign each medical record a positive or negative class for each complication. Subsequently, according to the previous class assignment for each complication, a medical record was drafted manually. No LLMs were used in the development of the synthetic patient medical notes. Additionally, to increase the robustness of the analysis, noise was introduced, consisting of text referring to the occurrence of postoperative complications semantically associated with, but not equivalent to the examined complications. Additional noise, consisting of typos, and abbreviations requiring context-dependent disambiguation were added artificially to better simulate medical notes found in clinical practice. Finally, some connector text about the overall postoperative course was introduced to bridge the gap between different pieces of information and help with the smooth transition between different text parts.

The synthetic medical notes were manually drafted by one domain expert (VN) and evaluated by two domain experts (VN, HRCB) for inconsistencies between the predefined text characteristics (presence or absence of the selected postoperative complications) and the finally drafted text. In case of disagreement between the two evaluators, a third domain expert (OD) would reevaluate the synthetic text; however, no reevaluations were required due to complete agreement between VN and HRCB. The synthetic medical notes can be found in the online supplemental material.

LLMs and prompt modelling

Based on information found on LLM platforms and related websites at the beginning of the study conduct, we compiled a list, that although non-exhaustive, contained most, to our knowledge, known accessible-over-the-internet closed- and open-source LLMs. We assessed the performance of the following LLMs: Open AI GPT 4 and GPT 3.5 (Open AI, San Francisco, CA), Anthropic Claude 3.0 Opus, Claude 3.0 Sonnet, Claude 3.0 Haiku, Claude 2.1, Claude 2.0 and Claude 1.0 (Anthropic, San Francisco, CA), Google Gemini Advanced, Gemini and PaLM 2 chat bison (Google, Mountain View, CA), Meta AI Llama 3-70b, Llama 3-8b, Llama 2-70b chat, Llama 2-13b chat and Llama 2-7b chat (Meta AI, New York City, NY), Mistral 7b instruct (Mistral AI, Paris, France) and Cohere command (Cohere, Toronto, ON). The RoBERTa base model, a nowadays less sophisticated transformer-based model published in 2019, was used as a baseline model to contextualise the performance of the examined LLMs. The raw model was pretrained with the masked language modelling objective, so for this study we used the RoBERTa base model fine-tuned on the SQuAD2.0 (Stanford Question Answering Dataset) dataset for the task of question answering.

LLM use and prompting were performed through the OpenAI API, https://platform.openai.com/docs/api-reference (for GPT 4 and GPT 3.5), the OpenRouter.ai API, https://openrouter.ai/docs (for Claude 3.0 Opus, Claude 3.0 Sonnet, Claude 3.0 Haiku, Claude 2.1, Claude 2.0, Claude 1.0, PaLM 2 chat bison, Lllama 3-70b, Llama 3-8b and Mistral 7b instruct), the Replicate API, https://replicate.com/docs (for Llama 2-70b chat, Llama 2-13b chat and Llama 2-7b chat), the Cohere API, https://cohere.com/ (for Cohere Command), and a dedicated API inference endpoint on Hugging Face, https://huggingface.co/deepset/roberta-base-squad2 (for RoBERTa base QA SQuAD 2.0). Due to the non-availability of an API for Gemini Advanced and Gemini in the authors’ country, LLM use and prompting were performed through the Gemini website, https://gemini.google.com. The temperature parameter of each LLM was set to default.

Several sets of LLM instructions were engineered iteratively to process a few test medical records in a preliminary run, with the aim of retrieving the values of all requested variables in tabular form, through a single LLM prompt, one patient at a time. Both LLM instruction set and medical note were entered simultaneously into the LLM prompt. LLM prompting strategies implementing simultaneous entry of multiple patients, or LLM instruction entry only once simultaneously with the first patient medical record at the beginning of the data extraction process, were associated with hallucinations and were abandoned. The LLM prompting text describing each of the requested variables and their respective values performed consistently well over all LLM prompting engineering iterations. Due to the differences in the prompt-input structure of the RoBERTa base model, the data extraction questions were entered in the prompt one at a time. The LLM prompting method is presented in the online supplemental material.

Evaluation

We assessed the performance of each LLM on entity extraction and binary classification tasks. Entity extraction involved two numerical values (patient identifier and EuroSCORE 2 value) and two dates (hospital admission and discharge date). Binary classification involved five dichotomous postoperative outcomes (stroke, reoperation due to bleeding or cardiac tamponade, pacemaker implantation, atrial fibrillation, and pleural tap). The performance of all LLMs was assessed in a zero-shot setting, with the extraction task exclusively dependent on the instructions given in the prompts without any additional illustrative or training examples.

We used accuracy, recall, precision and F1-score to assess the performance of LLMs in binary classification. We used only accuracy to assess the performance of LLMs in entity extraction. The overall accuracy of each LLM was assessed based on its combined accuracy in both entity extraction and binary classification. The synthetic medical note evaluation by two domain experts (VN, HRCB) was the gold standard for evaluating LLM performance. The gold standard data frame is provided in the online supplemental material. Missing values (when LLM returned no value for the requested variable), and values other than ‘0’ or ‘1’ (in binary classification tasks) were handled as false values.

The consistency of model response was assessed over three iterations of the same prompt for each model, and was quantified with the use of the Krippendorff’s alpha value and the number of value agreements over all three iterations. If a model returned at least one false value over the three iterations, then a false value was recorded as the model response, and this value was used for the calculation of the LLM performance metrics.

The OpenAI API, and the OpenRouter.ai API were accessed and used through a series of programmatic calls executed in R V.4.3.1 (R Core Team (2022). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria). The Replicate API, the Cohere API and the dedicated API inference endpoint on Hugging Face were accessed and used through a series of programmatic calls executed in Python V.3.12.2 (Python Software Foundation (2024). The Python Language Reference. Available at https://www.python.org). The function ‘confusionMatrix’ of the ‘caret’ package in R was used to generate the confusion matrices and calculate accuracy, recall, precision and F1-score. The function ‘sample’ of the R base package was used to generate 1000 bootstrapped samples for each LLM response dataset. The number of true positive, true negative, false positive, false negative, total true and the value of accuracy, recall, precision and F1-score were calculated for each bootstrapped sample, and the 2.5th and 97.5th percentiles of the distribution of each metric over all 1000 bootstrapped samples were identified as the lower and upper 95% confidence limits, respectively, for each metric. The function ‘kripp.alpha’ of the ‘irr’ package in R was used to calculate the Krippendorff’s alpha value.

Results

The results of the performance assessment of all 19 LLMs over a total of 200 entity extraction and 250 binary classification tasks are provided in tables 1 and 2. The overall accuracy of the examined LLMs ranked in descending order is graphically presented in figure 1. Claude 3 Opus exhibited the highest overall accuracy (0.995), with seven other LLMs exhibiting an overall accuracy over 0.98 (Claude 3.0 Sonnet, Claude 2.0, GPT 4, Claude 2.1, Gemini Advanced, PaLM 2 chat bison, Llama 3-70b). 15 LLMs exhibited a higher accuracy than the baseline model RoBERTa (0.742).

Figure 1

Request permissions

Overall accuracy of large language models ranked in descending order. Baseline model highlighted with coloured border.

Table 1

•

LLM performance metrics

Table 2

•

Large language model multiple-run consistency performance metrics ranked in descending order of Krippendorff’s alpha value

The accuracy of each LLM in entity extraction ranked in descending order is graphically presented with radar charts in figure 2. Overall, nine LLMs (Claude 3.0 Opus, Claude 2.0, GPT 4, Claude 2.1, PaLM 2 chat bison, Llama 3-70b, GPT 3.5, Llama 2-70b chat, Llama 2-13b chat) exhibited an accuracy of 1 in entity extraction without returning false values. Seventeen LLMs had a higher accuracy than the baseline model RoBERTa which returned 40 false out of 200 values. The accuracy of each LLM in binary classification ranked in descending order is graphically presented with radar charts in figure 3. Claude 3.0 Opus, Claude 3.0 Sonnet, Claude 2.0 and GPT 4 exhibited the highest scores with 2, 3, 5 and 5 false values, respectively. 14 LLMs showed a higher accuracy than the baseline model RoBERTa which returned 76 false out of 250 values.

Figure 2

Request permissions

Radar chart of entity extraction accuracy of LLMs ranked in descending order. LLM, large language model.

Figure 3

Request permissions

Radar chart of binary classification accuracy of LLMs ranked in descending order. LLM, large language model.

Claude 2.0 exhibited the most consistent performance for multiple runs of the same prompts with a perfect response agreement over all three prompt iterations and, subsequently, a maximal Krippendorff’s alpha value of 1. Another six LLMs (Claude 2.1, Claude 3.0 Opus, PaLM 2 chat bison, GPT 4, Claude 3.0 Sonnet, Llama 3-70b) showed a higher multiple-run consistency than the baseline model RoBERTa (Krippendorff’s alpha value 0.988).

Missing values were observed in the output of Cohere command, Llama 2-7b chat, Mistral 7b instruct, PaLM 2 chat bison, Gemini and Llama 2-70b chat (74, 55, 18, 2, 1, and 1 value, respectively, summed over all three runs). Non-requested values (values unmatched to the requested entity extraction and classification tasks) were observed in the output of Llama 2-13b chat, Claude 1 and Mistral 7b instruct (97, 25, and 2 values, respectively, summed overall three runs). Misclassification to values other than ‘0’ or ‘1’, was observed in Mistral 7b instruct, Gemini Advanced and Llama 2-13b chat (7, 2, and 1 value, respectively, summed over all three runs). Due to the output of the baseline RoBERTa model on the binary classification tasks being text, with no explicit binary classification response in ‘0’ or ‘1’, the value misclassification assessment was not applicable for this model.

The median (first and third quartile) text length was 192 (152–239) words per patient medical note. The additional instructions provided at the beginning of each LLM prompt were 413 words long.

Discussion

Even though the crucial role of data in clinical practice, healthcare research and quality assessment cannot be questioned, data collection from medical records is a time-consuming and error-prone task that may shift physician time allocation away from direct patient care.1–3 Automating these tasks with the use of LLMs can help unburden physicians from data collection tasks and increase the quantity and quality of time spent with patients. LLMs can be used in data collection from unstructured and semi-structured medical records, such as referral letters, discharge summaries, radiology reports, pathology reports, and blood tests, and, thus, can assist physicians and other healthcare professionals directly with the systematic processing of previous medical information, and indirectly with clinical decision-making. Furthermore, such implementation of LLMs can help leverage medical data in under-resourced hospitals, and assist or undertake data collection for newly established and previously non-existent databases, processing large quantities of unstructured and semi-structured medical notes stored in electronic health records.12

The process of manually extracting data from medical records is rather tedious than complicated and sophisticated, performed mainly by junior doctors or other adequately trained healthcare professionals. Modern LLMs have exhibited a remarkably high, human-level performance on various professional and academic benchmarks in numerous domains, including medicine.8 In an analysis by Nori et al, GPT 4 exceeded the passing score of USMLE by over 20 points in a zero-shot setting.13 In view of this notable performance of GPT 4 in medical licensing examinations, we hypothesised that GPT 4 and other modern LLMs might be able to reliably perform the relatively low-complexity task of entity extraction from and binary classification of text from medical notes.

Eight LLMs showed an accuracy of over 0.98. Claude 3.0 Opus, exhibited the highest performance overall, returning a correct value for all but two of the total 450 requested values and having the highest performance metrics of all LLMs assessed in this study, with an accuracy over 0.99. The second best performance in terms of number of correct values and accuracy, exhibited Claude 3.0 Sonnet, Claude 2.0 and GPT 4, all of which returned five false values each. Following, were Claude 2.1, Gemini Advanced, PaLM 2 chat bison and Llama 3-70b, each of which returned six, eight, eight, and eight false values, respectively. All false values returned by Claude 3.0 Opus, Claude 2.0, GPT 4, Claude 2.1, PaLM 2 chat bison, and Llama 3-70b were binary classifications, whereas Claude 3.0 Sonnet and Gemini Advanced additionally returned two and one false value, respectively in entity extraction. None of the eight highest performing LLMs returned non-requested/unmatched values. Except for two missing values for PaLM 2 chat bison, and two misclassifications to values other than the ones explicitly requested in the LLM prompt for Gemini Advanced, no missing values and misclassifications were returned by Claude 3.0 Opus, Claude 2.0, Claude 3.0 Sonnet, GPT 4, Claude 2.1, and Llama 3-70b.

The abovementioned eight highest performing LLMs exhibited higher performance than the baseline RoBERTa model with at least 32% higher overall accuracy, at least 140% higher recall and at least 70% higher F1 score than the baseline model. Regarding precision, Claude 3.0 Opus, GPT 4 and Gemini Advanced exhibited a marginally higher (maximally 0.4%), and Claude 3.0 Sonnet, Claude 2.0, Claude 2.1, PaLM 2 chat bison and Llama 3-70b a slightly lower performance (maximally −4.1%) than the baseline RoBERTa model. Finally, regarding multiple-run response consistency, Claude 2.0 exhibited the highest performance, with a perfect response agreement over all three runs, whereas six of the eight highest performing LLMs showed a marginally higher (maximally 1.2%), and only Gemini Advanced a marginally lower (−0.3%) consistency than the baseline RoBERTa model.

Different LLMs have different context lengths, and recent LLMs have increasingly larger context lengths, however, concerns exist about LLM performance in longer context texts. In a study analysing LLM performance on multi-document question answering and key-value retrieval, it was found that performance can degrade significantly when changing the position of relevant information in the input text. Particularly, the highest performance was often observed when information was placed at the beginning or end of the input text, and the lowest performance when information was placed in the middle of long texts, even for explicitly long-context LLMs.14 Another study, assessing long-context LLM performance on different input text lengths, found large performance drops as the input text length increases, indicating that the effective context length can be lower than the claimed context length, with some LLMs exhibiting larger performance drops than others.15 As a result, increasing clinical note length may have a negative impact on the performance of several LLMs.

The accuracy of a diagnostic model varies directly with disease prevalence, and the upper and lower bounds of accuracy are determined by the model’s sensitivity and specificity. In a population with a disease prevalence of 100%, the accuracy of a model equals its sensitivity; in a population with a disease prevalence of 0%, accuracy equals specificity. Between the bounds determined by sensitivity and specificity, accuracy varies directly (linearly) with disease prevalence. If disease prevalence equals 50%, a model’s accuracy is exactly midway between its sensitivity and specificity.16 Consequently, for rare conditions with low prevalence LLM accuracy is expected to approximate LLM specificity. We calculated LLM specificity and LLM classification accuracy, and as expected, due to the 50% prevalence of each examined complication, the LLM classification accuracy was the exact mean of the LLM recall and LLM specificity (online supplemental material).

An analysis of the accuracy of manual data collection from electronic health records observed an average transcription error rate of 9.1% per patient dataset for retrieving a dataset containing 27 variables.17 Even higher extraction error rates, up to 50% in some cases, were found in a systematic review assessing the frequency of manual data extraction errors in the setting of meta-analyses.18 In our study, none of the evaluated LLMs exhibited a completely correct performance; however, considering the high transcription error rates observed in some studies, the performance of the highest performing LLMs, should be viewed as outstanding.

LLMs have shown better performance in few-shot as compared with zero-shot evaluations.13 However, in this analysis, only a zero-shot evaluation was performed, as this better approximates how human evaluators process medical notes in everyday practice. Another issue occasionally occurring when using LLMs is ‘hallucinations’ or confabulated responses, where the model’s response does not seem to be justified by text input and training data.19 In our study, some LLMs exhibited a few confabulated responses, where non-requested or values unmatched to the requested entity extraction and classification tasks were observed in the LLM output.

A major issue complicating the use of LLMs for data extraction from real patient records in a real-world setting, is the submission of sensitive patient data to out-of-hospital computer infrastructure, which due to the associated patient data security risks, requires approval by the hospital data governance board and local ethics committee, and presents a serious, if not insurmountable, challenge. Even though this issue concerns LLMs running in out-of-hospital cloud infrastructure, local installations of open-source LLMs might be able to deal successfully with this issue, although running an LLM locally can be computationally intensive and require significant computational resources. Deidentifying and date-shifting clinical data, as used by Sushil et al in a study of LLM-based classification of breast cancer pathology reports, can be another solution to this issue, even though deidentifying discharge summaries and referral letters could be technically challenging.10

Even though some LLMs exhibited an impressively high performance in data extraction from medical notes, the results of this study should still be interpreted with caution. The present analysis has been based on the input of relatively short-sized synthetic medical notes, on a small number of entity extraction and binary classification tasks, with no assessment of classification performance for categorical variables with more than two-level responses. Furthermore, as the clinical notes were drafted from only one domain expert, text variability can be expected to be relatively limited, which might further affect the generalisability of these results to different institutions, regions and countries with different styles and abbreviations in documentation. Although noise consisting of typos, and abbreviations requiring context-dependent disambiguation was added to the synthetic text to better simulate healthcare practice, the clinical notes might still be over-processed and not fully representative of real-world text. Finally, this study assessed LLM performance only on data extraction from medical notes related to the postoperative course after cardiac surgery, and for a small number of related variables/complications, therefore, the generalisability of these results to other medical specialties or other medical contexts might be limited, and warrants further research for LLM performance assessment in other use cases. The key point regarding the use of LLMs for data extraction from medical records, is that as is the case with humans, LLMs are limited to identifying only what is documented in these medical records.

However, the strengths of this study should be also emphasised. Even though non-exhaustive in the number of LLMs included, our study assessed a large number of LLMs, evaluating the performance of 18 LLMs and comparing their performance against a fine-tuned question answering version of RoBERTa base, a nowadays less sophisticated transformer-based model; to our knowledge, our LLM list comprises most known closed- and open-source LLMs. Moreover, a model response consistency assessment was performed over three iterations of the same prompt for each model. Although the text entered in the LLM prompts was synthetic, it was carefully drafted and evaluated by domain experts to represent medical notes commonly found in clinical practice; in addition, it was class-balanced in order to assess better the discriminatory ability of each LLM for both positive and negative classes, desired attributes rarely present in real-world medical records; finally, to increase the robustness of the analysis, noise was introduced in the text. As a consequence, the results of this study are suggestive of the performance of the evaluated LLMs in a real-world setting.

Conclusions

In our study, Claude 3.0 Opus, Claude 3.0 Sonnet, Claude 2.0, GPT 4, Claude 2.1, Gemini Advanced, PaLM 2 chat bison and Llama 3-70b exhibited outstanding performance in both entity extraction from and binary classification of synthetic medical notes, with highly consistent responses over multiple same-prompt iterations. Their use could help leverage data for clinical practice, healthcare research and quality assessment, and increase patient care quality by unburdening physicians and other healthcare professionals. Further and larger-scale analyses on real medical records should be conducted to confirm their performance in a real-world setting.

Supplementary PDF

Contributors: VN: conceptualisation, methodology, formal analysis, investigation, resources, data curation, writing – original draft, writing – review and editing, visualisation. HRCB: conceptualisation, writing – review and editing, visualisation. IT, NP, DO, PR, AH: conceptualisation, writing – review and editing. OD: conceptualisation, resources, writing – review and editing, supervision, project administration, guarantor.
Funding: The authors have not declared a specific grant for this research from any funding agency in the public, commercial or not-for-profit sectors.
Competing interests: None declared.
Provenance and peer review: Not commissioned; externally peer reviewed.
Supplemental material: This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.

Data availability statement

All data relevant to the study are included in the article or uploaded as supplementary information.

Ethics statements

Patient consent for publication:

Ethics approval:

Not applicable.

Sinsky C, Colligan L, Li L, et al. Allocation of Physician Time in Ambulatory Practice: A Time and Motion Study in 4 Specialties. Ann Intern Med 2016; 165:753–60.
doi:10.7326/M16-0961•Google Scholar
Joukes E, Abu-Hanna A, Cornet R, et al. Time Spent on Dedicated Patient Care and Documentation Tasks Before and After the Introduction of a Structured and Standardized Electronic Health Record. Appl Clin Inform 2018; 9:46–53.
doi:10.1055/s-0037-1615747•Google Scholar
Yin AL, Guo WL, Sholle ET, et al. Comparing automated vs. manual data collection for COVID-specific medications from electronic health records. Int J Med Inform 2022; 157:104622.
doi:10.1016/j.ijmedinf.2021.104622•Google Scholar
Lüscher TF, Wenzl FA, D’Ascenzo F, et al. Artificial intelligence in cardiovascular medicine: clinical applications. Eur Heart J 2024; 45:4291–304.
doi:10.1093/eurheartj/ehae465•Google Scholar
Alowais SA, Alghamdi SS, Alsuhebany N, et al. Revolutionizing healthcare: the role of artificial intelligence in clinical practice. BMC Med Educ 2023; 23.
doi:10.1186/s12909-023-04698-z•Google Scholar
Wang Y, Zhao Y, Petzold L, et al. Are Large Language Models Ready for Healthcare? A Comparative Study on Clinical Language Understanding. arXiv 2023;
doi:10.48550/arXiv.2304.05368•Google Scholar
Adamson B, Waskom M, Blarre A, et al. Approach to machine learning for extraction of real-world data variables from electronic health records. Front Pharmacol 2023; 14.
doi:10.3389/fphar.2023.1180962•Google Scholar
Padula WV, Kreif N, Vanness DJ, et al. Machine Learning Methods in Health Economics and Outcomes Research-The PALISADE Checklist: A Good Practices Report of an ISPOR Task Force. V H 2022; 25:1063–80.
doi:10.1016/j.jval.2022.03.022•Google Scholar
OpenAI. GPT-4 technical report. arXiv 2023;
Google Scholar
Lee P, Bubeck S, Petro J, et al. Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine. N Engl J Med 2023; 388:1233–9.
doi:10.1056/NEJMsr2214184•Google Scholar
Sushil M, Zack T, Mandair D, et al. A comparative study of large language model-based zero-shot inference and task-specific supervised classification of breast cancer pathology reports. J Am Med Inform Assoc 2024; 31:2315–27.
doi:10.1093/jamia/ocae146•Google Scholar
Shaffer JG, Doumbia SO, Ndiaye D, et al. Development of a data collection and management system in West Africa: challenges and sustainability. Infect Dis Poverty 2018; 7:125.
doi:10.1186/s40249-018-0494-4•Google Scholar
Nori H, King N, McKinney SM, et al. Capabilities of GPT-4 on Medical Challenge Problems. arXiv 2023;
doi:10.48550/arXiv.2303.13375•Google Scholar
Liu NF, Lin K, Hewitt J, et al. Lost in the Middle: How Language Models Use Long Contexts. Trans Assoc Comput Linguist 2024; 12:157–73.
doi:10.1162/tacl_a_00638•Google Scholar
Sun S, Kriman S. RULER: What’s the Real Context Size of Your Long-Context Language Models. arXiv 2024;
doi:10.48550/arXiv.2404.06654•Google Scholar
Eisenberg MJ. Accuracy and predictive values in clinical decision-making. Cleve Clin J Med 1995; 62:311–6.
doi:10.3949/ccjm.62.5.311•Google Scholar
Feng JE, Anoushiravani AA, Tesoriero PJ, et al. Transcription Error Rates in Retrospective Chart Reviews. Orthopedics 2020; 43:e404–8.
doi:10.3928/01477447-20200619-10•Google Scholar
Mathes T, Klaßen P, Pieper D, et al. Frequency of data extraction errors and methods to increase data extraction quality: a methodological review. BMC Med Res Methodol 2017; 17:152.
doi:10.1186/s12874-017-0431-4•Google Scholar
Ji Z, Lee N, Frieske R, et al. Survey of Hallucination in Natural Language Generation. ACM Comput Surv 2023; 55:1–38.
doi:10.1145/3571730•Google Scholar