Multimodal emotion recognition for mental health monitoring using Audio and Text

  • Unique Paper ID: 195568
  • Volume: 12
  • Issue: 11
  • PageNo: 1888-1895
  • Abstract:
  • Maintaining one’s general well-being depends heavily on one’s mental health, but assessing it frequently relies on subjective self-reporting, which may not fully reflect emotional states. This project suggests a Multimodal Emotion Recognition System that uses text and audio inputs to track and analyze human emotions in order to assess mental health. To extract linguistic and acoustic features like tone, pitch, sentiment, and contextual meaning, the suggested model combines speech signal processing (SLP) and natural language processing (NLP) techniques. The system combines these multimodal features using deep learning architectures such as Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) to produce a more accurate emotion classification than unimodal methods. To ensure robustness and generalization, the model is trained and evaluated on benchmark emotion datasets. In order to identify possible indicators of stress, anxiety, or depression, the identified emotions are further examined, providing important information for early mental health intervention. This study emphasizes how important it is to integrate audio prosody and textual semantics in order to create intelligent, real-time systems that facilitate emotional comprehension and mental health monitoring.

Copyright & License

Copyright © 2026 Authors retain the copyright of this article. This article is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

BibTeX

@article{195568,
        author = {Paka Sowmya and Ramavath Akhila and P Kushal and Dr Sreenivasulu},
        title = {Multimodal emotion recognition for mental health monitoring using Audio and Text},
        journal = {International Journal of Innovative Research in Technology},
        year = {2026},
        volume = {12},
        number = {11},
        pages = {1888-1895},
        issn = {2349-6002},
        url = {https://ijirt.org/article?manuscript=195568},
        abstract = {Maintaining one’s general well-being depends heavily on one’s mental health, but assessing it frequently relies on subjective self-reporting, which may not fully reflect emotional states. This project suggests a Multimodal Emotion Recognition System that uses text and audio inputs to track and analyze human emotions in order to assess mental health. To extract linguistic and acoustic features like tone, pitch, sentiment, and contextual meaning, the suggested model combines speech signal processing (SLP) and natural language processing (NLP) techniques. The system combines these multimodal features using deep learning architectures such as Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) to produce a more accurate emotion classification than unimodal methods. To ensure robustness and generalization, the model is trained and evaluated on benchmark emotion datasets. In order to identify possible indicators of stress, anxiety, or depression, the identified emotions are further examined, providing important information for early mental health intervention. This study emphasizes how important it is to integrate audio prosody and textual semantics in order to create intelligent, real-time systems that facilitate emotional comprehension and mental health monitoring.},
        keywords = {Multimodal Emotion Recognition, Mental Health Monitoring, Audio-Text Fusion, Deep Learning, Speech Emotion Recognition, Natural Language Processing (NLP)},
        month = {April},
        }

Cite This Article

Sowmya, P., & Akhila, R., & Kushal, P., & Sreenivasulu, D. (2026). Multimodal emotion recognition for mental health monitoring using Audio and Text. International Journal of Innovative Research in Technology (IJIRT), 12(11), 1888–1895.

Related Articles