MLOps for AI: Tracking, Synthesizing, and Monitoring Models

  • Unique Paper ID: 179691
  • PageNo: 8765-8771
  • Abstract:
  • In recent years, Transformer-based architectures have revolutionized the field of video understanding by enabling models to capture rich spatiotemporal dependencies. This review provides a comprehensive examination of frame-level video analysis using Transformer-based feature extraction techniques. We trace the evolution from early video transformers like TimeSformer and ViViT to more advanced and efficient variants such as VideoMAE, Swin Transformer, and Uniformer. The review presents a comparative study of these models in terms of accuracy, computational efficiency, and real-world applicability, using benchmark datasets like 50Salads and Something-Something V2. We also propose a theoretical framework, highlight domain-specific use cases, and outline key future directions including efficient attention mechanisms, self-supervised learning, and explainable AI. The synthesis offered here serves as both a state-of-the-art summary and a research roadmap for building robust, scalable, and interpretable Transformer-based video systems.

Copyright & License

Copyright © 2026 Authors retain the copyright of this article. This article is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

BibTeX

@article{179691,
        author = {Manish Tripathi},
        title = {MLOps for AI: Tracking, Synthesizing, and Monitoring Models},
        journal = {International Journal of Innovative Research in Technology},
        year = {2025},
        volume = {11},
        number = {12},
        pages = {8765-8771},
        issn = {2349-6002},
        url = {https://ijirt.org/article?manuscript=179691},
        abstract = {In recent years, Transformer-based architectures have revolutionized the field of video understanding by enabling models to capture rich spatiotemporal dependencies. This review provides a comprehensive examination of frame-level video analysis using Transformer-based feature extraction techniques. We trace the evolution from early video transformers like TimeSformer and ViViT to more advanced and efficient variants such as VideoMAE, Swin Transformer, and Uniformer. The review presents a comparative study of these models in terms of accuracy, computational efficiency, and real-world applicability, using benchmark datasets like 50Salads and Something-Something V2. We also propose a theoretical framework, highlight domain-specific use cases, and outline key future directions including efficient attention mechanisms, self-supervised learning, and explainable AI. The synthesis offered here serves as both a state-of-the-art summary and a research roadmap for building robust, scalable, and interpretable Transformer-based video systems.},
        keywords = {Transformer; Frame-Level Video Analysis; VideoMAE; TimeSformer; Temporal Segmentation; Spatiotemporal Attention; Vision Transformers; Action Recognition; Self-Supervised Learning; Multimodal Video Understanding},
        month = {May},
        }

Cite This Article

Tripathi, M. (2025). MLOps for AI: Tracking, Synthesizing, and Monitoring Models. International Journal of Innovative Research in Technology (IJIRT), 11(12), 8765–8771.

Related Articles