Computer Vision-Based Automated Image Caption Generation

  • Unique Paper ID: 188252
  • Volume: 12
  • Issue: 11
  • PageNo: 2703-2710
  • Abstract:
  • The task of automatically generating natural-language descriptions for images is a core challenge at the intersection of computer vision and natural language processing. This paper presents a deep learning-based framework for automated image caption generation, designed to accurately describe the content of an image in a coherent and grammatically correct sentence. Our system employs a hybrid architecture combining a Convolutional Neural Network (CNN) as an encoder to extract visual features from an image and a Recurrent Neural Network (RNN), specifically a Long Short-Term Memory (LSTM) network, as a decoder to generate the corresponding caption. The model is trained and evaluated on a large-scale dataset of images with human-annotated captions, demonstrating its ability to produce descriptive and contextually relevant text. The architecture's performance is measured using standard metrics such as BLEU and METEOR, showing promising results in generating high-quality captions that capture the nuances of the visual content.

Copyright & License

Copyright © 2026 Authors retain the copyright of this article. This article is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

BibTeX

@article{188252,
        author = {VIJAYKUMAR YADAV and ANAND MAHA and LOKESH RATHOD and RALPH GONSALVES},
        title = {Computer Vision-Based Automated Image Caption Generation},
        journal = {International Journal of Innovative Research in Technology},
        year = {2026},
        volume = {12},
        number = {11},
        pages = {2703-2710},
        issn = {2349-6002},
        url = {https://ijirt.org/article?manuscript=188252},
        abstract = {The task of automatically generating natural-language descriptions for images is a core challenge at the intersection of computer vision and natural language processing. This paper presents a deep learning-based framework for automated image caption generation, designed to accurately describe the content of an image in a coherent and grammatically correct sentence. Our system employs a hybrid architecture combining a Convolutional Neural Network (CNN) as an encoder to extract visual features from an image and a Recurrent Neural Network (RNN), specifically a Long Short-Term Memory (LSTM) network, as a decoder to generate the corresponding caption. The model is trained and evaluated on a large-scale dataset of images with human-annotated captions, demonstrating its ability to produce descriptive and contextually relevant text. The architecture's performance is measured using standard metrics such as BLEU and METEOR, showing promising results in generating high-quality captions that capture the nuances of the visual content.},
        keywords = {Image Captioning, Computer Vision, Deep Learning, Convolutional Neural Networks (CNN), Long Short-Term Memory (LSTM), Natural Language Processing (NLP), Multimodal Learning},
        month = {April},
        }

Cite This Article

YADAV, V., & MAHA, A., & RATHOD, L., & GONSALVES, R. (2026). Computer Vision-Based Automated Image Caption Generation. International Journal of Innovative Research in Technology (IJIRT). https://doi.org/doi.org/10.64643/IJIRTV12I11-188252-459

Related Articles