DeepFake Image Detection using CvT Model

  • Unique Paper ID: 183822
  • PageNo: 3174-3180
  • Abstract:
  • Deep Fake have become a serious challenge to society for trusting digital media as they can contain manipulative content. In this work, we apply Microsoft Convolution Vision Transformer (CvT) model to detect deep fake using the recently introduced dataset DF40 dataset. Unlike traditional CNN-based methods, CvT combines the local feature extraction strength of convolutions with the global reasoning capabilities of transformers, allowing it to capture both fine-grained artifacts and broader semantic inconsistencies. We train the CvT model end-to-end on DF40 and evaluate its performance without relying on additional ensembles or handcrafted features. The proposed approach achieves an accuracy of 86.26%, demonstrating that CvT can serve as a strong baseline for deepfake detection. Our results highlight the potential of transformer-based vision architectures in building scalable, accurate, and adaptable deepfake forensics systems.

Copyright & License

Copyright © 2026 Authors retain the copyright of this article. This article is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

BibTeX

@article{183822,
        author = {Vivek Prajapati and Dr. Ashwin I Mehta},
        title = {DeepFake Image Detection using CvT Model},
        journal = {International Journal of Innovative Research in Technology},
        year = {2025},
        volume = {12},
        number = {3},
        pages = {3174-3180},
        issn = {2349-6002},
        url = {https://ijirt.org/article?manuscript=183822},
        abstract = {Deep Fake have become a serious challenge to society for trusting digital media as they can contain manipulative content. In this work, we apply Microsoft Convolution Vision Transformer (CvT) model to detect deep fake using the recently introduced dataset DF40 dataset. Unlike traditional CNN-based methods, CvT combines the local feature extraction strength of convolutions with the global reasoning capabilities of transformers, allowing it to capture both fine-grained artifacts and broader semantic inconsistencies. We train the CvT model end-to-end on DF40 and evaluate its performance without relying on additional ensembles or handcrafted features. The proposed approach achieves an accuracy of 86.26%, demonstrating that CvT can serve as a strong baseline for deepfake detection. Our results highlight the potential of transformer-based vision architectures in building scalable, accurate, and adaptable deepfake forensics systems.},
        keywords = {},
        month = {August},
        }

Cite This Article

Prajapati, V., & Mehta, D. A. I. (2025). DeepFake Image Detection using CvT Model. International Journal of Innovative Research in Technology (IJIRT), 12(3), 3174–3180.

Related Articles