SignAI: Indian Sign Language Recognition Using VideoMAE-Based Vision Transformers

  • Unique Paper ID: 202972
  • Volume: 12
  • Issue: 12
  • PageNo: 8293-8300
  • Abstract:
  • Bridging communication between the deaf and hearing communities demands reliable automated sign language interpretation. Widespread unfamiliarity with sign language among the general population creates substantial accessibility challenges. This paper presents SignAI, a video-based deep learning system for recognizing Indian Sign Language (ISL). The architecture centers on Video MAE, a vision transformer pre-trained through masked autoencoding on video data, which simultaneously encodes hand shape and motion cues from gesture clips. Unlike prior methods that depend on CNNs or LSTM-based recurrent networks, SignAI treats each gesture as a continuous temporal sequence. The pipeline processes raw video into 16-frame clips, applies them to a fine-tuned Video MAE backbone, and maps the extracted representations to one of 101 ISL categories. End-to-end deployment pairs a React-based upload interface with a Fast API inference server that returns both text and synthesized speech. Experiments on the ISL-CSLTR benchmark show a Top-1 validation accuracy of approximately 74%, with well-balanced precision, recall, and F1-score, confirming the suitability of transformer-based sequence models for capturing the fine-grained spatiotemporal structure of ISL gestures.

Copyright & License

Copyright © 2026 Authors retain the copyright of this article. This article is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

BibTeX

@article{202972,
        author = {Chinmayee B and Meghana G K and Dhanusha Patel M and Suman kumar mahto},
        title = {SignAI: Indian Sign Language Recognition Using VideoMAE-Based Vision Transformers},
        journal = {International Journal of Innovative Research in Technology},
        year = {2026},
        volume = {12},
        number = {12},
        pages = {8293-8300},
        issn = {2349-6002},
        url = {https://ijirt.org/article?manuscript=202972},
        abstract = {Bridging communication between the deaf and hearing communities demands reliable automated sign language interpretation. Widespread unfamiliarity with sign language among the general population creates substantial accessibility challenges. This paper presents SignAI, a video-based deep learning system for recognizing Indian Sign Language (ISL). The architecture centers on Video MAE, a vision transformer pre-trained through masked autoencoding on video data, which simultaneously encodes hand shape and motion cues from gesture clips. 
Unlike prior methods that depend on CNNs or LSTM-based recurrent networks, SignAI treats each gesture as a continuous temporal sequence. The pipeline processes raw video into 16-frame clips, applies them to a fine-tuned Video MAE backbone, and maps the extracted representations to one of 101 ISL categories. End-to-end deployment pairs a React-based upload interface with a Fast API inference server that returns both text and synthesized speech. 
Experiments on the ISL-CSLTR benchmark show a Top-1 validation accuracy of approximately 74%, with well-balanced precision, recall, and F1-score, confirming the suitability of transformer-based sequence models for capturing the fine-grained spatiotemporal structure of ISL gestures.},
        keywords = {Indian Sign Language (ISL), Gesture Recognition, Video MAE, Vision Transformer, Deep Learning, Spatiotemporal Analysis, Sequence-Based Classification, Computer Vision},
        month = {May},
        }

Cite This Article

B, C., & K, M. G., & M, D. P., & mahto, S. K. (2026). SignAI: Indian Sign Language Recognition Using VideoMAE-Based Vision Transformers. International Journal of Innovative Research in Technology (IJIRT), 12(12), 8293–8300.

Related Articles