Deepfake Forensics Using Ensemble of Convolutional Neural Networks and Vision Transformers

  • Unique Paper ID: 176212
  • PageNo: 5937-5943
  • Abstract:
  • Deepfake technology has advanced rapidly due to the development of generative models and artificial intelligence making it possible to create incredibly realistic looking but fake videos and images. This presents serious risks to digital integrity, misinformation, and privacy. Conventional deepfake detection techniques, which frequently depend on manually created features or single-model architectures, have demonstrated a limited ability to withstand the changing synthetic media landscape. In this paper, we propose a novel deepfake forensic framework that uses an ensemble architecture that combines the complementary strengths of Vision Transformers (ViTs) and Convolutional Neural Networks (CNNs). Whereas ViTs are skilled at simulating global dependencies and contextual anomalies in visual content, CNNs are best at capturing local texture-based inconsistencies. To improve detection accuracy and generalizability across a variety of datasets and manipulation techniques, our suggested ensemble approach combines the predictive outputs of several CNN and ViT models. This study opens the door for more dependable and robust multimedia forensics systems by demonstrating the potential of hybrid deep learning architectures in tackling the escalating problem of deepfake detection.

Copyright & License

Copyright © 2026 Authors retain the copyright of this article. This article is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

BibTeX

@article{176212,
        author = {Jainam Joshi and Dr. Nilesh Parihar},
        title = {Deepfake Forensics Using Ensemble of Convolutional Neural Networks and Vision Transformers},
        journal = {International Journal of Innovative Research in Technology},
        year = {2025},
        volume = {11},
        number = {11},
        pages = {5937-5943},
        issn = {2349-6002},
        url = {https://ijirt.org/article?manuscript=176212},
        abstract = {Deepfake technology has advanced rapidly due to the development of generative models and artificial intelligence making it possible to create incredibly realistic looking but fake videos and images. This presents serious risks to digital integrity, misinformation, and privacy. Conventional deepfake detection techniques, which frequently depend on manually created features or single-model architectures, have demonstrated a limited ability to withstand the changing synthetic media landscape. In this paper, we propose a novel deepfake forensic framework that uses an ensemble architecture that combines the complementary strengths of Vision Transformers (ViTs) and Convolutional Neural Networks (CNNs). Whereas ViTs are skilled at simulating global dependencies and contextual anomalies in visual content, CNNs are best at capturing local texture-based inconsistencies. To improve detection accuracy and generalizability across a variety of datasets and manipulation techniques, our suggested ensemble approach combines the predictive outputs of several CNN and ViT models. This study opens the door for more dependable and robust multimedia forensics systems by demonstrating the potential of hybrid deep learning architectures in tackling the escalating problem of deepfake detection.},
        keywords = {Identification of Deepfakes, Forensics with Multimedia, CNNs, or convolutional neural networks, Transformers of Vision, Group Education, Artificial Media, Security of AI, Learning Transfer.},
        month = {April},
        }

Cite This Article

Joshi, J., & Parihar, D. N. (2025). Deepfake Forensics Using Ensemble of Convolutional Neural Networks and Vision Transformers. International Journal of Innovative Research in Technology (IJIRT), 11(11), 5937–5943.

Related Articles