Close

Presentation

Context at Scale: Integrating Quantitative and LLM-Based Qualitative Analysis of Patient Feedback to Explain Patient Satisfaction
DescriptionHospitals collect thousands of free-text patient comments each year, yet transforming those narratives into reliable, actionable insights and linking them with measurable outcomes remains a major challenge. Patient experience surveys typically provide numeric satisfaction scores, while open-text comments capture the “why” behind those numbers. However, qualitative data analysis of patient comments is constrained by the scale of these large datasets. On the other hand, analysis of quantitative feedback data lacks qualitative context and limits its interpretability. The potential of patient feedback data for learning and continuous quality improvement is, therefore, underutilized. To address the above limitations, this research develops a mixed-methods approach that scales qualitative insight using large language models (LLMs) and integrates it with quantitative predictors of patient experience. Using 7,071 emergency department (ED) feedback reports from a large hospital system in the southeastern United States, each paired with Likelihood-to-Recommend (LTR) scores and patient-encounter metadata, we demonstrate how human–AI teaming and statistical modeling can help identify both systemic and patient-level drivers of satisfaction.

BACKGROUND & MOTIVATION

Prior research has used complaint audits, sentiment analysis, and topic modeling, but these approaches either lack scalability or fail to connect themes with measurable process metadata and outcomes. LLMs, with their ability to interpret text in a human-like way, make it possible to classify patient narratives at scale. However, ensuring reliability and aligning outputs with quantitative data remain ongoing challenges. Our study addresses these gaps by applying prompt engineering techniques to improve performance, validating LLM-generated labels against human-coded benchmarks, and then linking these reliable classifications with regression-based outcome modeling. This approach shows how narrative insights can directly inform hospital performance improvement.

METHODOLOGY

We analyzed 7,071 ED feedback reports, each record included an LTR score (0–10) plus metadata (age, gender, race/ethnicity, acuity, length of stay, overcrowding scores on arrival, and survey mode). Text comments were de-identified before analysis. This dataset provided a pairing of structured, quantitative measures with unstructured patient narratives, enabling multi-level analysis.

Qualitative Data Analysis: We used a two-step approach to scaling analysis of patient narratives:
(i) Labeling by human coders: Two human analysts independently labeled 200 comments using an inductive-deductive approach. Open-ended themes were inductively identified within three specific categories: Risks/Challenges, Actions/Strategies and Facilitators. Through an iterative process of reconciliation, near-perfect reliability (Cohen’s κ ≈ 0.9 - 1.0 across most categories) was achieved.
(ii) LLM-based labeling: An LLM (Llama-3-70B) was then prompt-tuned using few-shot examples and human-in-the-loop feedback, reaching >98% agreement with human-generated labels. This validated model then scaled labeling to the full dataset of 7,071 comments.

Quantitative Data Analysis: Each patient feedback report included metadata related to the patient (e.g. age, acuity) and their ED encounter (e.g. length of stay, wait time, ED overcrowding score), and the patient-provided Likelihood-to-Recommend (LTR) score. We applied ordinal logistic regression to test whether the metadata variables predicted LTR score. Significance testing was conducted using coefficient estimates, standard errors, z-values, and p-values to identify which predictors had meaningful effects on patient satisfaction.

Blended Analysis: We examined how LLM-generated risks, strategies, and facilitator themes varied across waiting time categories (e.g., 0–0.5 hr, 1–2 hr, ≥4 hrs) and LTR score ranges (0-3, 4-7, and 8-10). Heatmaps and distribution plots were generated to show how the prevalence of themes shifted under different encounter conditions, such as longer wait times or higher versus lower patient ratings. This approach allowed us to capture not only which themes were predictive of LTR, but also how their frequency changed across different waiting time contexts and rating levels. We further benchmarked classifier performance for risks, strategies, and facilitators using ROC AUC, precision, recall, and F1 metrics.

RESULTS

Of the 7,071 reports with valid LTR scores, high ratings (scores 8 – 10) were the most common (≈55%), followed by low ratings (0 – 3, ≈18%) and mid-range ratings (4 – 7, ≈16%). Higher Overcrowding and longer stay in the waiting room strongly predicted lower LTR.

Themes: The most frequent challenges (risks) were lengthy or complicated processes (≈37.1%) and unsatisfactory medical treatment (≈36.8%), followed by treatment issues (≈35.2%), unclean areas (≈34.2%), communication barriers (≈30.6%) and etc. While wait time (≈15.9%) appeared often, it was less dominant than process, treatment, and environment concerns. In the Actions category, patients emphasized strategies situation management (≈54.6%), timely assistance (≈45.0%), along with resource mobilization, empathy and caring, care coordination, and staff communication. These were especially common in high ratings narratives. Key facilitators included comfort and environment (≈32.6%), supportive staff (≈25.7%), and positive treatment outcomes (≈23.9%). Overall, operational breakdowns drove dissatisfaction, whereas interpersonal support most often shifted experiences toward the positive.

Regression analysis confirmed that text-derived themes were strong predictors of Likelihood-to-Recommend (LTR) independent of encounter metadata. Mentions of lengthy or complicated processes (β ≈ −3.70, OR ≈ 0.02), unsatisfactory medical treatment (β ≈ −3.63, Odds Ratio (OR) ≈ 0.03), patient treatment issues (β ≈ −3.40, OR ≈ 0.03), and unclean areas (β ≈ −3.34, OR ≈ 0.04) significantly reduced satisfaction (p < 0.001). In contrast, positive strategies such as situation management (β ≈ +2.37, OR ≈ 10.7), timely assistance (β ≈ +1.80, OR ≈ 6.0), resource mobilization (β ≈ +1.62, OR ≈ 5.0), care coordination (β ≈ +1.51, OR ≈ 4.5), and empathy/caring in interactions (β ≈ +1.48, OR ≈ 4.4) were strong predictors of higher LTR, as were facilitators including comfort and environment (β ≈ +2.37, OR ≈ 10.7), supportive staff (β ≈ +2.02, OR ≈ 7.6), and positive treatment outcomes (β ≈ +2.08, OR ≈ 8.0).

The common themes analysis showed that low ratings (LTR 0–3) most often described operational failures such as lengthy processes, poor treatment, and unclean environments, particularly for waiting times over four hours. mid-range rating (LTR 4–7) reflected a mix of risks and coping strategies, while high ratings (LTR 8–10) emphasized facilitators like supportive staff, comfort, and positive outcomes, with effective strategies concentrated in shorter waits. Overall, dissatisfaction clustered around operational failures, while interpersonal support and positive environments drove higher satisfaction. Finally, predictive modeling using logistic regression classifiers demonstrated strong performance (ROC AUC ≈ 0.85–0.92 across risks, strategies, and facilitators). Although precision varied due to class imbalance, the best-performing models reliably identified narratives about wait time, empathy, and supportive staff, highlighting the feasibility of using automated theme detection for real-time monitoring.

LIMITATIONS & FUTURE WORK

Our initial quality checks of LLM-generated themes revealed some missing representations in the full dataset, highlighting the need for more systematic validation. Moving forward, we will use larger-scale audits, refined prompting strategies, and cross-model comparisons to improve the accuracy, reliability, and confidence in applying LLM results in healthcare.

CONCLUSION

This study shows that combining LLM-powered thematic coding with quantitative modeling yields a deeper understanding of patient experience than either method alone. Across 7,000+ comments, lengthy or complicated processes consistently emerged as the most common challenge, yet staff empathy and attentiveness often offset their negative impact, driving higher satisfaction even in overcrowded settings. By validating human–AI teaming for scalable, reliable coding, the study highlights how qualitative signals such as supportive staff, comfort, cleanliness, and positive treatment outcomes add explanatory power beyond structured data. Overall, this mixed-methods approach demonstrates how human–AI labeling and statistical modeling can generate scalable, outcome-linked insights from patient narratives. The contributions are both methodological and practical: demonstrating that text-derived themes enhance predictive models of satisfaction and offering hospitals a framework to continuously monitor patient feedback, connect narratives with outcomes, and target improvements that balance operational efficiency with compassionate care.
Event Type
Oral Presentations
TimeTuesday, March 2411:37am - 12:00pm EDT
LocationMurray Hill East
Tracks
Patient Safety Research and Initiatives