Zero-Shot Image Classification Using CLIP For Interactive Kiosk Assistance

  • Unique Paper ID: 185330
  • PageNo: 883-889
  • Abstract:
  • Interactive kiosks have become essential in public spaces such as malls, museums, zoos, sanctuaries, and educational centers providing visitors with access to required information [1]. Most modern kiosks are equipped with touchscreen interfaces, QR code scanning [2], and voice recognition to improve user engagement. This paper presents the design of an image-classifying kiosk, where visitors can interact by means of their mobile phones to identify animals and converse with the kiosk to gain more knowledge. For image classification, traditional methods use Convolutional Neural Networks (CNN) models such as ResNet, EfficientNet and VGG uses hierarchical features from images to classify them into predefined categories [3] [4] [5]. But one major limitation with these models is that they require labeled datasets which are descriptive and require extensive pre-training, which are expensive and time-consuming [6] . In recent years, zero-shot learning has emerged as a groundbreaking approach in image classification, enabling models to recognize objects they have never encountered during training [7]. Our proposed system uses attribute-based zero-shot capabilities of a vision-language model called OpenAI's Contrastive Language-Image Pretraining (CLIP) on Animals with Attributes 2 (AWA2) dataset [8]. It effectively identifies both seen and unseen categories without explicit training on every possible class. It evaluates this by maximizing the cosine similarity between the correct pair of image and class-attribute embeddings [9]. The kiosk also utilizes a chat feature powered by LLaMA 3.3 70B Versatile model to dynamically generate and answer context- sensitive questions about the classified animal [10]. Therefore, our proposed application allows the user to experience zoos and sanctuaries not just for curiosity and recreation, but focuses on animal education, interpretation, and research [11].

Copyright & License

Copyright © 2026 Authors retain the copyright of this article. This article is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

BibTeX

@article{185330,
        author = {Shruthiga K and Samyukta K and Sowndarya S},
        title = {Zero-Shot Image Classification Using CLIP For Interactive Kiosk Assistance},
        journal = {International Journal of Innovative Research in Technology},
        year = {2025},
        volume = {12},
        number = {5},
        pages = {883-889},
        issn = {2349-6002},
        url = {https://ijirt.org/article?manuscript=185330},
        abstract = {Interactive kiosks have become essential in public spaces such as malls, museums, zoos, sanctuaries, and educational centers providing visitors with access to required information [1]. Most modern kiosks are equipped with touchscreen interfaces, QR code scanning [2], and voice recognition to improve user engagement. This paper presents the design of an image-classifying kiosk, where visitors can interact by means of their mobile phones to identify animals and converse with the kiosk to gain more knowledge. For image classification, traditional methods use Convolutional Neural Networks (CNN) models such as ResNet, EfficientNet and VGG uses hierarchical features from images to classify them into predefined categories [3] [4] [5]. But one major limitation with these models is that they require labeled datasets which are descriptive and require extensive pre-training, which are expensive and time-consuming [6] . In recent years, zero-shot learning has emerged as a groundbreaking approach in image classification, enabling models to recognize objects they have never encountered during training [7]. Our proposed system uses attribute-based zero-shot capabilities of a vision-language model called OpenAI's Contrastive Language-Image Pretraining (CLIP) on Animals with Attributes 2 (AWA2) dataset [8]. It effectively identifies both seen and unseen categories without explicit training on every possible class. It evaluates this by maximizing the cosine similarity between the correct pair of image and class-attribute embeddings [9]. The kiosk also utilizes a chat feature powered by LLaMA 3.3 70B Versatile model to dynamically generate and answer context- sensitive questions about the classified animal [10]. Therefore, our proposed application allows the user to experience zoos and sanctuaries not just for curiosity and recreation, but focuses on animal education, interpretation, and research [11].},
        keywords = {Zero-shot learning, Contrastive Language- Image Pretraining (CLIP) , ViT-B/32 image encoder, ViT-B/16 image encoder, LLaMA3.3 70B model, Animals with Attributes 2 (AWA2) Dataset, Kiosk Systems.},
        month = {October},
        }

Cite This Article

K, S., & K, S., & S, S. (2025). Zero-Shot Image Classification Using CLIP For Interactive Kiosk Assistance. International Journal of Innovative Research in Technology (IJIRT), 12(5), 883–889.

Related Articles