TERT-Ensemble: A Two-Stage Fusion Approach for Tri-modal Emotion Recognition

  • Unique Paper ID: 178452
  • PageNo: 5118-5126
  • Abstract:
  • Automated emotion recognition is essential for advancing human-computer interaction and affective computing. While unimodal systems using image, text, or audio offer valuable insights, they often struggle with the inherent ambiguity and complexity of human emotional expression. Multi- modal approaches promise enhanced robustness by integrating complementary information from diverse sources. This paper introduces ’TERT-Ensemble’, a technique employing a two-stage fusion process for tri-modal emotion recognition. Initially, we establish strong unimodal predictors by systematically evaluating diverse deep learning architectures (CNNs, ViT, Transformers, LSTM) for each modality (image, text, audio) on combined benchmark datasets (CK+, FER-2013, RAF-DB; Twitter datasets; CREMA-D, RAVDESS, SAVEE). Optimal performance within each modality was achieved through intra-modal ensembles, yielding weighted F1-scores of 77.98% (Image), 73.69% (Text), and 66.43% (Audio) on their respective test sets. Subsequently, a tri-modal late fusion model was implemented, combining the outputs of these intra-modal ensembles via weighted probability averaging. Evaluated on a simulated tri-modal test set designed to ensure congruent emotional labels across modalities, this final fusion model achieved high performance, with an accuracy of 95.56% and a weighted F1-score of 0.96. We detail the data preprocessing, model architectures, training protocols, and the two-stage fusion strategy implemented within Kaggle notebooks. An interactive demonstration interface showcasing the unimodal components is also presented. While acknowledging the limitations inherent in using simulated data for the final fusion evaluation, these results validate the effectiveness of the intra- modal ensembles and highlight the significant potential of the proposed staged fusion approach for robust tri-modal emotion recognition.

Copyright & License

Copyright © 2026 Authors retain the copyright of this article. This article is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

BibTeX

@article{178452,
        author = {Penumuchu Nihith and V Lingeshwaran},
        title = {TERT-Ensemble: A Two-Stage Fusion Approach for Tri-modal Emotion Recognition},
        journal = {International Journal of Innovative Research in Technology},
        year = {2025},
        volume = {11},
        number = {12},
        pages = {5118-5126},
        issn = {2349-6002},
        url = {https://ijirt.org/article?manuscript=178452},
        abstract = {Automated emotion recognition is essential for advancing human-computer interaction and affective computing. While unimodal systems using image, text, or audio offer valuable insights, they often struggle with the inherent ambiguity and complexity of human emotional expression. Multi- modal approaches promise enhanced robustness by integrating complementary information from diverse sources. This paper introduces ’TERT-Ensemble’, a technique employing a two-stage fusion process for tri-modal emotion recognition. Initially, we establish strong unimodal predictors by systematically evaluating diverse deep learning architectures (CNNs, ViT, Transformers, LSTM) for each modality (image, text, audio) on combined benchmark datasets (CK+, FER-2013, RAF-DB; Twitter datasets; CREMA-D, RAVDESS, SAVEE). Optimal performance within each modality was achieved through intra-modal ensembles, yielding weighted F1-scores of 77.98% (Image), 73.69% (Text), and 66.43% (Audio) on their respective test sets. Subsequently, a tri-modal late fusion model was implemented, combining the outputs of these intra-modal ensembles via weighted probability averaging. Evaluated on a simulated tri-modal test set designed to ensure congruent emotional labels across modalities, this final fusion model achieved high performance, with an accuracy of 95.56% and a weighted F1-score of 0.96. We detail the data preprocessing, model architectures, training protocols, and the two-stage fusion strategy implemented within Kaggle notebooks. An interactive demonstration interface showcasing the unimodal components is also presented. While acknowledging the limitations inherent in using simulated data for the final fusion evaluation, these results validate the effectiveness of the intra- modal ensembles and highlight the significant potential of the proposed staged fusion approach for robust tri-modal emotion recognition.},
        keywords = {Emotion Recognition, Multimodal Learning, Late Fusion, Ensemble Learning, Deep Learning, Affective Computing, Image Recognition, Text Classification, Speech Emotion Recognition, Convolutional Neural Networks (CNN), Vision Transformer (ViT), BERT, DeBERTa, LSTM, EfficientNet, Gradio.},
        month = {May},
        }

Cite This Article

Nihith, P., & Lingeshwaran, V. (2025). TERT-Ensemble: A Two-Stage Fusion Approach for Tri-modal Emotion Recognition. International Journal of Innovative Research in Technology (IJIRT), 11(12), 5118–5126.

Related Articles