Multimodal Speech Emotion Recognition in Patient-Clinician Interactions: Sentiment Analysis Leveraging Transformer Models

Md Jobair Hossain Faruk; Md Kamrul Siam; Rafia Akter Romana; Sharif Ullah; Hossain Shahriar

doi:10.1109/STI69347.2025.11367613

Back

Multimodal Speech Emotion Recognition in Patient-Clinician Interactions: Sentiment Analysis Leveraging Transformer Models

Conference proceeding

Peer reviewed

Multimodal Speech Emotion Recognition in Patient-Clinician Interactions: Sentiment Analysis Leveraging Transformer Models

Md Jobair Hossain Faruk, Md Kamrul Siam, Rafia Akter Romana, Sharif Ullah and Hossain Shahriar

2025 IEEE 7th International Conference on Sustainable Technologies For Industry 5.0 (STI)

International Conference on Sustainable Technologies For Industry 5.0 (STI), 7th (Dhaka, Bangladesh, 12/11/2025–12/12/2025)

12/11/2025

DOI: https://doi.org/10.1109/STI69347.2025.11367613

Metrics

1 Record Views

Abstract

In recent years, understanding the emotional dynamics of patient-clinician interactions has emerged as a critical topic in healthcare research. Speech Emotion Recognition (SER) provides critical insights to enhance patient care, diagnostic precision, and therapeutic effectiveness. In this paper, we present a text-based framework for Speech Emotion Recognition specifically designed for healthcare scenarios, integrating advanced transformer-based models including T5, BERT, and XLNet. Our proposed framework analyzes transcribed textual data, enabling the identification of potential emotions expressed by patients and healthcare providers. Audio recordings from interactions between patients and clinicians-including doctors and psychiatrists-are transcribed using the Whisper model, ensuring high transcription quality. We evaluated the framework's performance on a dataset comprising clinical conversations capturing a variety of emotional expressions relevant to healthcare contexts. Our experimental results demonstrate that our framework predicts six primary emotional states, including Happiness, Anger, Fear, Sadness, and Surprise, as well as distinguishing between positive and negative sentiments. Among the evaluated models, T5 exhibited the highest mean confidence score at89.12 % , significantly outperforming RoBERTa ( 78.44 % ) and XLNet ( 36.02 % ) in capturing emotional content from clinical dialogues. These findings highlight the potential of SER to aid healthcare professionals by providing deeper insights into patients' emotional states, supporting communication, and improving understanding of patients' sentiment.

Details

Title: Multimodal Speech Emotion Recognition in Patient-Clinician Interactions
Publication Details: 2025 IEEE 7th International Conference on Sustainable Technologies For Industry 5.0 (STI)
Resource Type: Conference proceeding
Conference: International Conference on Sustainable Technologies For Industry 5.0 (STI), 7th (Dhaka, Bangladesh, 12/11/2025–12/12/2025)
Publisher: IEEE
Number of pages: 6
Identifiers: 99381798346406600
Academic Unit: Center for Cybersecurity and AI; Hal Marcus College of Science and Engineering
Language: English

Multimodal Speech Emotion Recognition in Patient-Clinician Interactions: Sentiment Analysis Leveraging Transformer Models

Metrics

Abstract

Related links

Details

University of West Florida Social media