In recent years, understanding the emotional dynamics of patient-clinician interactions has emerged as a critical topic in healthcare research. Speech Emotion Recognition (SER) provides critical insights to enhance patient care, diagnostic precision, and therapeutic effectiveness. In this paper, we present a text-based framework for Speech Emotion Recognition specifically designed for healthcare scenarios, integrating advanced transformer-based models including T5, BERT, and XLNet. Our proposed framework analyzes transcribed textual data, enabling the identification of potential emotions expressed by patients and healthcare providers. Audio recordings from interactions between patients and clinicians-including doctors and psychiatrists-are transcribed using the Whisper model, ensuring high transcription quality. We evaluated the framework's performance on a dataset comprising clinical conversations capturing a variety of emotional expressions relevant to healthcare contexts. Our experimental results demonstrate that our framework predicts six primary emotional states, including Happiness, Anger, Fear, Sadness, and Surprise, as well as distinguishing between positive and negative sentiments. Among the evaluated models, T5 exhibited the highest mean confidence score at89.12 % , significantly outperforming RoBERTa ( 78.44 % ) and XLNet ( 36.02 % ) in capturing emotional content from clinical dialogues. These findings highlight the potential of SER to aid healthcare professionals by providing deeper insights into patients' emotional states, supporting communication, and improving understanding of patients' sentiment.
Related links
Details
Title
Multimodal Speech Emotion Recognition in Patient-Clinician Interactions
Publication Details
2025 IEEE 7th International Conference on Sustainable Technologies For Industry 5.0 (STI)
Resource Type
Conference proceeding
Conference
International Conference on Sustainable Technologies For Industry 5.0 (STI), 7th (Dhaka, Bangladesh, 12/11/2025–12/12/2025)