BMJ

Current Issue Highlights

Age against the machine—susceptibility of large language models to cognitive impairment: cross sectional analysis
December 28, 2024: 387:e081948
ABSTRACT
OBJECTIVE
To evaluate the cognitive abilities of the leading large language models and identify their susceptibility to cognitive impairment, using the Montreal Cognitive Assessment (MoCA) and additional tests.
DESIGN
Cross sectional analysis.
SETTING
Online interaction with large language models via text based prompts.
PARTICIPANTS
Publicly available large language models, or “chatbots”: ChatGPT versions 4 and 4o (developed by OpenAI), Claude 3.5 “Sonnet” (developed by Anthropic), and Gemini versions 1 and 1.5 (developed by Alphabet).
ASSESSMENTS
The MoCA test (version 8.1) was administered to the leading large language models with instructions identical to those given to human patients. Scoring followed official guidelines and was evaluated by a practising neurologist. Additional assessments included the Navon figure, cookie theft picture, Poppelreuter figure, and Stroop test.
MAIN OUTCOME MEASURES
MoCA scores, performance in visuospatial/executive tasks, and Stroop test results.
RESULTS
ChatGPT 4o achieved the highest score on the MoCA test (26/30), followed by ChatGPT 4 and Claude (25/30), with Gemini 1.0 scoring lowest (16/30). All large language models showed poor performance in visuospatial/executive tasks. Gemini models failed at the delayed recall task. Only ChatGPT 4o succeeded in the incongruent stage of the Stroop test.
CONCLUSIONS
With the exception of ChatGPT 4o, almost all large language models subjected to the MoCA test showed signs of mild cognitive impairment. Moreover, as in humans, age is a key determinant of cognitive decline: “older” chatbots, like older patients, tend to perform worse on the MoCA test. These findings challenge the assumption that artificial intelligence will soon replace human doctors, as the cognitive impairment evident in leading chatbots may affect their reliability in medical diagnostics and undermine patients' confidence.
Dexterity assessment of hospital workers: prospective comparative study
December 28, 2024: 387:e081814
ABSTRACT
OBJECTIVES
To compare the manual dexterity and composure under pressure of people in different hospital staff roles using a buzz wire game.
DESIGN
Prospective, observational, comparative study (Tremor study).
SETTING
Leeds Teaching Hospitals NHS Trust, Leeds, UK, during a three week period in 2024.
PARTICIPANTS
254 hospital staff members comprising of 60 physicians, 64 surgeons, 69 nurses, and 61 non-clinical staff.
MAIN OUTCOME MEASURES
Successful completion of the buzz wire game within five minutes and occurrence of swearing and audible noises of frustration.
RESULTS
Of the 254 hospital staff that participated, surgeons had significantly higher success rates in completing the buzz wire game within five minutes (84%, n=54) compared with physicians (57%, n=34), nurses (54%, n=37), and non-clinical staff (51%, n=31) (P<0.001). Time-to-event analysis showed that surgeons were quicker to successfully complete the game, independent of age and gender. Surgeons exhibited the highest rate of swearing during the game (50%, n=32), followed by nurses (30%, n=21), physicians (25%, n=60), and non-clinical staff (23%, n=14) (P=0.004). Non-clinical staff showed the highest use of frustration noises (75%), followed by nurses (68%), surgeons (58%), and physicians (52%) (P=0.03).
CONCLUSIONS
Surgeons showed greater dexterity, but higher levels of swearing compared with other hospital staff roles, while nurses and non-clinical staff showed the highest rates of audible noises of frustration. The study highlights the diverse skill sets across hospital staff roles. Implementation of a surgical swear jar initiative should be considered for future fundraising events.
Coaching inexperienced clinicians before a high stakes medical procedure: randomized clinical trial
December 28, 2024: 387:e080924
ABSTRACT
OBJECTIVE
To assess whether training provided to an inexperienced clinician just before performing a high stakes procedure can improve procedural care quality, measuring the first attempt success rate of trainees performing infant orotracheal intubation.
DESIGN
Randomized clinical trial.
SETTING
Single center, quaternary children's hospital in Boston, MA, USA.
PARTICIPANTS
A non-crossover, prospective, parallel group, non-blinded, trial design was used. Volunteer trainees comprised pediatric anesthesia fellows, residents, and student registered nurse anesthetists from 10 regional training programs during their pediatric anesthesiology rotation. Trainees were block randomized by training roles. Inclusion criteria were trainees intubating infants aged ≤12 months with an American Society of Anesthesiology physical status classification of I-III. Exclusion criteria were trainees intubating infants with cyanotic congenital heart disease, known or suspected difficult or critical airways, pre-existing abnormal baseline oxygen saturation <96% on room air, endotracheal or tracheostomy tubes in situ, emergency cases, or covid-19 infection.
INTERVENTIONS
Trainee treatment group received preoperative just-in-time expert intubation coaching on a manikin within one hour of infant intubation; control group carried out standard practice (receiving unstructured intraoperative instruction by attending pediatric anesthesiologists).
MAIN OUTCOME MEASURES
Primary outcome was the first attempt success rate of intraoperative infant intubation. Modified intention-to-treat analysis used generalized estimating equations to account for multiple intubations per trainee participant. Secondary outcomes were complication rates, cognitive load of intubation, and competency metrics.
RESULTS
250 trainees were assessed for eligibility; 78 were excluded, 172 were randomized, and 153 were subsequently analyzed. Between 1 August 2020 and 30 April 2022, 153 trainees (83 control, 70 treatment) did 515 intubations (283 control, 232 treatment). In modified intention-to-treat analysis, first attempt success was 91.4% (212/232) in the trainee treatment group and 81.6% (231/283) in the control group (odds ratio 2.42 (95% confidence interval 1.45 to 4.04), P=0.001). Secondary outcomes favored the intervention, showing significance for decreased cognitive load and improved competency. Complications were lower for the intervention than for the control group but the difference was not significant.
CONCLUSIONS
Just-in-time training among inexperienced clinicians led to increased first attempt success of infant intubation. Integration of a just-in-time approach into airway management could improve patient safety, and these findings could help to improve high stakes procedures more broadly. Randomized evaluation in other settings is warranted.
TRIAL REGISTRATION
ClinicalTrials.gov NCT04472195.
Can artificial intelligence pass the Fellowship of the Royal College of Radiologists examination? Multi-reader diagnostic accuracy study
December 21, 2024: 387:e072826
ABSTRACT
OBJECTIVE
To determine whether an artificial intelligence candidate could pass the rapid (radiographic) reporting component of the Fellowship of the Royal College of Radiologists (FRCR) examination.
DESIGN
Prospective multi-reader diagnostic accuracy study.
SETTING
United Kingdom.
PARTICIPANTS
One artificial intelligence candidate (Smarturgences, Milvue) and 26 radiologists who had passed the FRCR examination in the preceding 12 months.
MAIN OUTCOME MEASURES
Accuracy and pass rate of the artificial intelligence compared with radiologists across 10 mock FRCR rapid reporting examinations (each examination containing 30 radiographs, requiring 90% accuracy rate to pass).
RESULTS
When non-interpretable images were excluded from the analysis, the artificial intelligence candidate achieved an average overall accuracy of 79.5% (95% confidence interval 74.1% to 84.3%) and passed two of 10 mock FRCR examinations. The average radiologist achieved an average accuracy of 84.8% (76.1-91.9%) and passed four of 10 mock examinations. The sensitivity for the artificial intelligence was 83.6% (95% confidence interval 76.2% to 89.4%) and the specificity was 75.2% (66.7% to 82.5%), compared with summary estimates across all radiologists of 84.1% (81.0% to 87.0%) and 87.3% (85.0% to 89.3%). Across 148/300 radiographs that were correctly interpreted by >90% of radiologists, the artificial intelligence candidate was incorrect in 14/148 (9%). In 20/300 radiographs that most (>50%) radiologists interpreted incorrectly, the artificial intelligence candidate was correct in 10/20 (50%). Most imaging pitfalls related to interpretation of musculoskeletal rather than chest radiographs.
CONCLUSIONS
When special dispensation for the artificial intelligence candidate was provided (that is, exclusion of non-interpretable images), the artificial intelligence candidate was able to pass two of 10 mock examinations. Potential exists for the artificial intelligence candidate to improve its radiographic interpretation skills by focusing on musculoskeletal cases and learning to interpret radiographs of the axial skeleton and abdomen that are currently considered “non-interpretable.”
Alzheimer's disease mortality among taxi and ambulance drivers: population based cross sectional study
December 21, 2024: 387:e082194
ABSTRACT
Objective
To analyze mortality attributed to Alzheimer's disease among taxi drivers and ambulance drivers, occupations that demand frequent spatial and navigational processing, compared with other occupations.
DESIGN
Population based cross-sectional study.
SETTING
Use of death certificates from the National Vital Statistics System in the United States, which were linked to occupation, 1 January 2020-31 December 2022.
PARTICIPANTS
Deceased adults aged 18 years and older.
MAIN OUTCOMES MEASURES
Among 443 occupations studied, percentage of deaths attributed to Alzheimer's disease for taxi drivers and ambulance drivers and each of the remaining 441 occupations, adjusting for age at death and other sociodemographic factors.
RESULTS
Of 8 972 221 people who had died with occupational information, 3.88% (348 328) had Alzheimer's disease listed as a cause of death. Among taxi drivers, 1.03% (171/16 658) died from Alzheimer's disease, while among ambulance drivers, the rate was 0.74% (10/1348). After adjustment, ambulance drivers (0.91% (95% confidence interval 0.35% to 1.48%)) and taxi drivers (1.03% (0.87% to 1.18%)) had the lowest proportion of deaths due to Alzheimer's disease of all occupations examined. This trend was not observed in other transportation related jobs that are less reliant on real time spatial and navigational processing or for other types of dementia. Results were consistent whether Alzheimer's disease was recorded as an underlying or contributing cause of death.
CONCLUSIONS
Taxi drivers and ambulance drivers, occupations involving frequent navigational and spatial processing, had the lowest proportions of deaths attributed to Alzheimer's disease of all occupations.

Latest Articles

Coaching inexperienced clinicians before a high stakes medical procedure: randomized clinical trial
BMJ. December 28, 2024: 387:e080924
ABSTRACT
OBJECTIVE
To assess whether training provided to an inexperienced clinician just before performing a high stakes procedure can improve procedural care quality, measuring the first attempt success rate of trainees performing infant orotracheal intubation.
DESIGN
Randomized clinical trial.
SETTING
Single center, quaternary children's hospital in Boston, MA, USA.
PARTICIPANTS
A non-crossover, prospective, parallel group, non-blinded, trial design was used. Volunteer trainees comprised pediatric anesthesia fellows, residents, and student registered nurse anesthetists from 10 regional training programs during their pediatric anesthesiology rotation. Trainees were block randomized by training roles. Inclusion criteria were trainees intubating infants aged ≤12 months with an American Society of Anesthesiology physical status classification of I-III. Exclusion criteria were trainees intubating infants with cyanotic congenital heart disease, known or suspected difficult or critical airways, pre-existing abnormal baseline oxygen saturation <96% on room air, endotracheal or tracheostomy tubes in situ, emergency cases, or covid-19 infection.
INTERVENTIONS
Trainee treatment group received preoperative just-in-time expert intubation coaching on a manikin within one hour of infant intubation; control group carried out standard practice (receiving unstructured intraoperative instruction by attending pediatric anesthesiologists).
MAIN OUTCOME MEASURES
Primary outcome was the first attempt success rate of intraoperative infant intubation. Modified intention-to-treat analysis used generalized estimating equations to account for multiple intubations per trainee participant. Secondary outcomes were complication rates, cognitive load of intubation, and competency metrics.
RESULTS
250 trainees were assessed for eligibility; 78 were excluded, 172 were randomized, and 153 were subsequently analyzed. Between 1 August 2020 and 30 April 2022, 153 trainees (83 control, 70 treatment) did 515 intubations (283 control, 232 treatment). In modified intention-to-treat analysis, first attempt success was 91.4% (212/232) in the trainee treatment group and 81.6% (231/283) in the control group (odds ratio 2.42 (95% confidence interval 1.45 to 4.04), P=0.001). Secondary outcomes favored the intervention, showing significance for decreased cognitive load and improved competency. Complications were lower for the intervention than for the control group but the difference was not significant.
CONCLUSIONS
Just-in-time training among inexperienced clinicians led to increased first attempt success of infant intubation. Integration of a just-in-time approach into airway management could improve patient safety, and these findings could help to improve high stakes procedures more broadly. Randomized evaluation in other settings is warranted.
TRIAL REGISTRATION
ClinicalTrials.gov NCT04472195.
Dexterity assessment of hospital workers: prospective comparative study
BMJ. December 28, 2024: 387:e081814
ABSTRACT
OBJECTIVES
To compare the manual dexterity and composure under pressure of people in different hospital staff roles using a buzz wire game.
DESIGN
Prospective, observational, comparative study (Tremor study).
SETTING
Leeds Teaching Hospitals NHS Trust, Leeds, UK, during a three week period in 2024.
PARTICIPANTS
254 hospital staff members comprising of 60 physicians, 64 surgeons, 69 nurses, and 61 non-clinical staff.
MAIN OUTCOME MEASURES
Successful completion of the buzz wire game within five minutes and occurrence of swearing and audible noises of frustration.
RESULTS
Of the 254 hospital staff that participated, surgeons had significantly higher success rates in completing the buzz wire game within five minutes (84%, n=54) compared with physicians (57%, n=34), nurses (54%, n=37), and non-clinical staff (51%, n=31) (P<0.001). Time-to-event analysis showed that surgeons were quicker to successfully complete the game, independent of age and gender. Surgeons exhibited the highest rate of swearing during the game (50%, n=32), followed by nurses (30%, n=21), physicians (25%, n=60), and non-clinical staff (23%, n=14) (P=0.004). Non-clinical staff showed the highest use of frustration noises (75%), followed by nurses (68%), surgeons (58%), and physicians (52%) (P=0.03).
CONCLUSIONS
Surgeons showed greater dexterity, but higher levels of swearing compared with other hospital staff roles, while nurses and non-clinical staff showed the highest rates of audible noises of frustration. The study highlights the diverse skill sets across hospital staff roles. Implementation of a surgical swear jar initiative should be considered for future fundraising events.
Age against the machine—susceptibility of large language models to cognitive impairment: cross sectional analysis
BMJ. December 28, 2024: 387:e081948
ABSTRACT
OBJECTIVE
To evaluate the cognitive abilities of the leading large language models and identify their susceptibility to cognitive impairment, using the Montreal Cognitive Assessment (MoCA) and additional tests.
DESIGN
Cross sectional analysis.
SETTING
Online interaction with large language models via text based prompts.
PARTICIPANTS
Publicly available large language models, or “chatbots”: ChatGPT versions 4 and 4o (developed by OpenAI), Claude 3.5 “Sonnet” (developed by Anthropic), and Gemini versions 1 and 1.5 (developed by Alphabet).
ASSESSMENTS
The MoCA test (version 8.1) was administered to the leading large language models with instructions identical to those given to human patients. Scoring followed official guidelines and was evaluated by a practising neurologist. Additional assessments included the Navon figure, cookie theft picture, Poppelreuter figure, and Stroop test.
MAIN OUTCOME MEASURES
MoCA scores, performance in visuospatial/executive tasks, and Stroop test results.
RESULTS
ChatGPT 4o achieved the highest score on the MoCA test (26/30), followed by ChatGPT 4 and Claude (25/30), with Gemini 1.0 scoring lowest (16/30). All large language models showed poor performance in visuospatial/executive tasks. Gemini models failed at the delayed recall task. Only ChatGPT 4o succeeded in the incongruent stage of the Stroop test.
CONCLUSIONS
With the exception of ChatGPT 4o, almost all large language models subjected to the MoCA test showed signs of mild cognitive impairment. Moreover, as in humans, age is a key determinant of cognitive decline: “older” chatbots, like older patients, tend to perform worse on the MoCA test. These findings challenge the assumption that artificial intelligence will soon replace human doctors, as the cognitive impairment evident in leading chatbots may affect their reliability in medical diagnostics and undermine patients' confidence.

View All

Most Popular Articles

Average length of a flaccid and erect penis is published to help counsellors
BMJ. March 7, 2015: 350:h1193
Savile's 50 year reign of abuse spanned 28 NHS hospitals, investigation finds
BMJ. June 28, 2014: 348:g4319
Orthopaedic surgeons: as strong as an ox and almost twice as clever? Multicentre prospective comparative study
BMJ. December 17, 2011: 343:d7506
Abstract
Objective
To compare the intelligence and grip strength of orthopaedic surgeons and anaesthetists.
Design
Multicentre prospective comparative study.
Setting
Three UK district general hospitals in 2011.
Participants
36 male orthopaedic surgeons and 40 male anaesthetists at consultant or specialist registrar grade.
Main outcome measures
Intelligence test score and dominant hand grip strength.
Results
Orthopaedic surgeons had a statistically significantly greater mean grip strength (47.25 (SD 6.95) kg) than anaesthetists (43.83 (7.57) kg). The mean intelligence test score of orthopaedic surgeons was also statistically significantly greater at 105.19 (10.85) compared with 98.38 (14.45) for anaesthetists.
Conclusions
Male orthopaedic surgeons have greater intelligence and grip strength than their male anaesthetic colleagues, who should find new ways to make fun of their orthopaedic friends.

View All

BMJ

ABSTRACT

OBJECTIVE

DESIGN

SETTING

PARTICIPANTS

ASSESSMENTS

MAIN OUTCOME MEASURES

RESULTS

CONCLUSIONS

ABSTRACT

OBJECTIVES

DESIGN

SETTING

PARTICIPANTS

MAIN OUTCOME MEASURES

RESULTS

CONCLUSIONS

ABSTRACT

OBJECTIVE

DESIGN

SETTING

PARTICIPANTS

INTERVENTIONS

MAIN OUTCOME MEASURES

RESULTS

CONCLUSIONS

TRIAL REGISTRATION

ABSTRACT

OBJECTIVE

DESIGN

SETTING

PARTICIPANTS

MAIN OUTCOME MEASURES

RESULTS

CONCLUSIONS

ABSTRACT

Objective

DESIGN

SETTING

PARTICIPANTS

MAIN OUTCOMES MEASURES

RESULTS

CONCLUSIONS

ABSTRACT

OBJECTIVE

DESIGN

SETTING

PARTICIPANTS

INTERVENTIONS

MAIN OUTCOME MEASURES

RESULTS

CONCLUSIONS

TRIAL REGISTRATION

ABSTRACT

OBJECTIVES

DESIGN

SETTING

PARTICIPANTS

MAIN OUTCOME MEASURES

RESULTS

CONCLUSIONS

ABSTRACT

OBJECTIVE

DESIGN

SETTING

PARTICIPANTS

ASSESSMENTS

MAIN OUTCOME MEASURES

RESULTS

CONCLUSIONS

Abstract

Objective

Design

Setting

Participants

Main outcome measures

Results

Conclusions