Presentation
AI-Assisted Medical Triage: Mitigating performance errors in human-AI emergency response training
DescriptionAbstract
Objective: to quantify how a real-time multimodal language model affects human performance in emergency triage, identify cognitive mechanisms underlying improvement and failure, and develop behavioral design features that promote human expertise when AI automation is present or withdrawn.
Rationale: Up to a quarter of emergency cases in U.S. (7 mil/yr.) may be under-triaged (Tam et al., 2018). Many of these errors are due to limitations of human decision-making, such as cognitive load. Large language models (LLMs) carry potential to serve as decision support systems, lowering cognitive load and increasing adherence to protocol by exploiting structured algorithms that are already in place in traditional triage (e.g., the START protocol). However, poorly designed LLMs can introduce new sources of error. Paradoxically, however, as AI performance increases, human performance decreases because when AI seems reliable, humans offload more and cease to monitor for errors (Lee et al., 2025). This raises a demand for cognitive science research to understand the factors likely to moderate human performance in these human-AI collaborations.
Innovation: Using retrieval augmented generation techniques, we developed a real-time language model using a state-of-the-art multimodal engine (GPT with vision + speech) that observes trainees in a disaster triage computer simulator under time pressure, encodes patient queries, and returns recommendations with justifications aligned to the START/JUMPSTART protocols. Decision-relevant features are controlled, such as whether or not the model requires a human judgment before revealing its recommendation (i.e., a pre-commitment requirement to probe for confirmation bias and anchoring effects).
Experiment: N = 120 EMT trainees at the GSU School of Nursing perform six 7-minute sessions of 60 Seconds to Survival (12 virtual patients per session). Participants are randomly assigned to AI-assistance or no-AI control. For the assisted group, Sessions 1 – 2 (practice trials) presents AI recommendations before soliciting a user judgment (no pre-commitment by user). In Sessions 3 - 4, this constraint is removed. For Sessions 5 – 6, AI assistance is withdrawn without warning under a simulated freeze. We measure patient-level triage classification accuracy, over- vs under-triage rates, response time, and adherence to START/JUMPSTART steps for assessment and treatment. Human-AI conversations are recorded and subject to linguistic analysis of strategies taken, indications of cognitive offloading, and compliance. Individual-difference moderators will be evaluated: memory recall for case decisions, digital literacy, task self-efficacy, and trust in automation.
Predictions: With active AI assistance, trainees will classify patients more quickly and accurately due to effective offloading. However, when AI assistance is withdrawn, performance will drop below baseline, consistent with overreliance. In addition, extent of offloading will predict performance (e.g, users who pre-commit a judgment before querying the AI’s recommendation will have higher total scores).
Analytic plan: Linear mixed-effects models estimate Group × Session interactions with random intercepts for participant and scenario. Confirmatory contrasts test two effects: (1) AI group outperforms controls during Sessions 3 - 4; (2) Control group outperforms AI group during Sessions 5 – 6 (AI withdrawn). Thematic analysis of human–AI dialogue characterizes AI-user prompting strategies (e.g., storage requests vs diagnosis seeking). Logistic and AUC analyses predict patient-level performance from AI triage accuracy, user prompting strategies, and offloading and compliance measures. Cox regression models time-to-correct classification. Moderation tests assess whether higher trust yields larger gains with AI but larger withdrawal costs.
Safety and ethics. An expert panel spanning emergency medicine, ethics, AI, and human factors informs risk analysis and mitigations. Anticipated risks include hallucinations, biased recommendations, and operator over-trust. Mitigations may include strict protocol grounding, confidence display, conservative thresholds, forced justification prompts, graded load exposure, fault injection, and routine manual-only practice to maintain skill. Policy outputs will specify accountability boundaries, documentation standards for AI-influenced decisions, and training requirements that mix AI-on and AI-off practice.
Contributions to cognitive science:
• Causal estimates of how automation changes attention, memory, and decision thresholds in a time-critical domain.
• Process measures that connect observable behaviors (speed, accuracy, compliance with advice) to relevant outcomes (over/under-triage)
• Evidence on skill retention and rebound following assistance withdrawal, advancing theories of cognitive offloading and deskilling.
• Validated behavioral levers (e.g., pre-commitment, confidence cues, graded information) that reduce overreliance while preserving benefit.
Implications: The work yields experimentally grounded guidance for training programs on when and how to train with the LLM assistant, how to present advice to sustain monitoring, and how to measure overreliance. By centering mechanisms of human cognition, the project clarifies not only whether AI helps in triage, but how and under what conditions it changes trainees vigilance and performance across time.
Objective: to quantify how a real-time multimodal language model affects human performance in emergency triage, identify cognitive mechanisms underlying improvement and failure, and develop behavioral design features that promote human expertise when AI automation is present or withdrawn.
Rationale: Up to a quarter of emergency cases in U.S. (7 mil/yr.) may be under-triaged (Tam et al., 2018). Many of these errors are due to limitations of human decision-making, such as cognitive load. Large language models (LLMs) carry potential to serve as decision support systems, lowering cognitive load and increasing adherence to protocol by exploiting structured algorithms that are already in place in traditional triage (e.g., the START protocol). However, poorly designed LLMs can introduce new sources of error. Paradoxically, however, as AI performance increases, human performance decreases because when AI seems reliable, humans offload more and cease to monitor for errors (Lee et al., 2025). This raises a demand for cognitive science research to understand the factors likely to moderate human performance in these human-AI collaborations.
Innovation: Using retrieval augmented generation techniques, we developed a real-time language model using a state-of-the-art multimodal engine (GPT with vision + speech) that observes trainees in a disaster triage computer simulator under time pressure, encodes patient queries, and returns recommendations with justifications aligned to the START/JUMPSTART protocols. Decision-relevant features are controlled, such as whether or not the model requires a human judgment before revealing its recommendation (i.e., a pre-commitment requirement to probe for confirmation bias and anchoring effects).
Experiment: N = 120 EMT trainees at the GSU School of Nursing perform six 7-minute sessions of 60 Seconds to Survival (12 virtual patients per session). Participants are randomly assigned to AI-assistance or no-AI control. For the assisted group, Sessions 1 – 2 (practice trials) presents AI recommendations before soliciting a user judgment (no pre-commitment by user). In Sessions 3 - 4, this constraint is removed. For Sessions 5 – 6, AI assistance is withdrawn without warning under a simulated freeze. We measure patient-level triage classification accuracy, over- vs under-triage rates, response time, and adherence to START/JUMPSTART steps for assessment and treatment. Human-AI conversations are recorded and subject to linguistic analysis of strategies taken, indications of cognitive offloading, and compliance. Individual-difference moderators will be evaluated: memory recall for case decisions, digital literacy, task self-efficacy, and trust in automation.
Predictions: With active AI assistance, trainees will classify patients more quickly and accurately due to effective offloading. However, when AI assistance is withdrawn, performance will drop below baseline, consistent with overreliance. In addition, extent of offloading will predict performance (e.g, users who pre-commit a judgment before querying the AI’s recommendation will have higher total scores).
Analytic plan: Linear mixed-effects models estimate Group × Session interactions with random intercepts for participant and scenario. Confirmatory contrasts test two effects: (1) AI group outperforms controls during Sessions 3 - 4; (2) Control group outperforms AI group during Sessions 5 – 6 (AI withdrawn). Thematic analysis of human–AI dialogue characterizes AI-user prompting strategies (e.g., storage requests vs diagnosis seeking). Logistic and AUC analyses predict patient-level performance from AI triage accuracy, user prompting strategies, and offloading and compliance measures. Cox regression models time-to-correct classification. Moderation tests assess whether higher trust yields larger gains with AI but larger withdrawal costs.
Safety and ethics. An expert panel spanning emergency medicine, ethics, AI, and human factors informs risk analysis and mitigations. Anticipated risks include hallucinations, biased recommendations, and operator over-trust. Mitigations may include strict protocol grounding, confidence display, conservative thresholds, forced justification prompts, graded load exposure, fault injection, and routine manual-only practice to maintain skill. Policy outputs will specify accountability boundaries, documentation standards for AI-influenced decisions, and training requirements that mix AI-on and AI-off practice.
Contributions to cognitive science:
• Causal estimates of how automation changes attention, memory, and decision thresholds in a time-critical domain.
• Process measures that connect observable behaviors (speed, accuracy, compliance with advice) to relevant outcomes (over/under-triage)
• Evidence on skill retention and rebound following assistance withdrawal, advancing theories of cognitive offloading and deskilling.
• Validated behavioral levers (e.g., pre-commitment, confidence cues, graded information) that reduce overreliance while preserving benefit.
Implications: The work yields experimentally grounded guidance for training programs on when and how to train with the LLM assistant, how to present advice to sustain monitoring, and how to measure overreliance. By centering mechanisms of human cognition, the project clarifies not only whether AI helps in triage, but how and under what conditions it changes trainees vigilance and performance across time.
Event Type
Oral Presentations
TimeMonday, March 232:00pm - 2:30pm EDT
LocationMorgan
Simulation and Education
