FakeVoiceGuard: A Hybrid ResNeXt–BiGRU– Transformer Framework for Robust Deepfake Audio Detection Using ASVspoof Datasets

  • Unique Paper ID: 193598
  • Volume: 12
  • Issue: 10
  • PageNo: 2165-2172
  • Abstract:
  • The rise of deepfake audio technologies that can create highly realistic synthetic voices has, among other things, digital forensic, biometric authentication, and media integrity challenges. The detection of such faked voices calls for models that can perfectly grasp the spatial, temporal, and contextual nature of the speech signals. The present work introduces FakeVoiceGuard, a hybrid deep learning model that combines ResNeXt, Bidirectional Gated Recurrent Unit (Bi-GRU), and Transformer-based classification to pinpoint artificial or spoofed voices with a high degree of accuracy. In the newly designed method, to capture the time and frequency characteristics, input audio samples in the form of log-mel spectrograms are generated. The ResNeXt encoder, through grouped residual convolutions, extracts the deep spectral features, while the Bi-GRU layer captures the bidirectional temporal dependencies of the speech. At last, the Transformer unit, employing multi-head self-attention, determines the most informative temporal segments for the classification. We test the model on the ASVspoof 2019 (Logical and Physical Access) and ASVspoof 2021 (LA, PA, and Deepfake) datasets, which contain a variety of spoofing attacks such as text to speech synthesis, voice conversion, and replay audio, and subsequently train it on them. Experimental results point to the fact that FakeVoiceGuard is the most accurate and robust as well as surpasses the reduction of the Equal Error Rate (EER) by a significant margin relative to the conventional CNN and RNN baselines. The use of hierarchical feature extraction, bidirectional temporal learning, and contextual attention makes it possible for the system to have a high level of generalization when coming across new spoofing methods. As such, this hybrid architecture is a step forward in the detection of deepfake voices, thus contributing to the safety and trust of audio communication.

Copyright & License

Copyright © 2026 Authors retain the copyright of this article. This article is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

BibTeX

@article{193598,
        author = {RANJITH R and VISWANATHAN C and TAMIL SELVAN B},
        title = {FakeVoiceGuard: A Hybrid ResNeXt–BiGRU– Transformer Framework for Robust Deepfake Audio Detection Using ASVspoof Datasets},
        journal = {International Journal of Innovative Research in Technology},
        year = {2026},
        volume = {12},
        number = {10},
        pages = {2165-2172},
        issn = {2349-6002},
        url = {https://ijirt.org/article?manuscript=193598},
        abstract = {The rise of deepfake audio technologies that can create highly realistic synthetic voices has, among other things, digital forensic, biometric authentication, and media integrity challenges. The detection of such faked voices calls for models that can perfectly grasp the spatial, temporal, and contextual nature of the speech signals. The present work introduces FakeVoiceGuard, a hybrid deep learning model that combines ResNeXt, Bidirectional Gated Recurrent Unit (Bi-GRU), and Transformer-based classification to pinpoint artificial or spoofed voices with a high degree of accuracy. In the newly designed method, to capture the time and frequency characteristics, input audio samples in the form of log-mel spectrograms are generated. The ResNeXt encoder, through grouped residual convolutions, extracts the deep spectral features, while the Bi-GRU layer captures the bidirectional temporal dependencies of the speech. At last, the Transformer unit, employing multi-head self-attention, determines the most informative temporal segments for the classification. We test the model on the ASVspoof 2019 (Logical and Physical Access) and ASVspoof 2021 (LA, PA, and Deepfake) datasets, which contain a variety of spoofing attacks such as text to speech synthesis, voice conversion, and replay audio, and subsequently train it on them. Experimental results point to the fact that FakeVoiceGuard is the most accurate and robust as well as surpasses the reduction of the Equal Error Rate (EER) by a significant margin relative to the conventional CNN and RNN baselines. The use of hierarchical feature extraction, bidirectional temporal learning, and contextual attention makes it possible for the system to have a high level of generalization when coming across new spoofing methods. As such, this hybrid architecture is a step forward in the detection of deepfake voices, thus contributing to the safety and trust of audio communication.},
        keywords = {Deepfake Audio Detection, ResNeXt, BiGRU, Transformer, ASVspoof, Audio Forensics, Spoofing Attack Detection, Speech Security.},
        month = {March},
        }

Cite This Article

R, R., & C, V., & B, T. S. (2026). FakeVoiceGuard: A Hybrid ResNeXt–BiGRU– Transformer Framework for Robust Deepfake Audio Detection Using ASVspoof Datasets. International Journal of Innovative Research in Technology (IJIRT), 12(10), 2165–2172.

Related Articles