AI/ML-Based Multilingual Document Translation System: Nepali and Sinhalese to English

  • Unique Paper ID: 197010
  • Volume: 12
  • Issue: 11
  • PageNo: 4940-4946
  • Abstract:
  • Language barriers cause significant limitations on the availability of regional literary works and historical texts, especially low-resourced ones like Nepali and Sinhalese. Most of these works are available either as scanned copies or hard copies and thus cannot be translated manually effectively and efficiently. In this paper, we introduce an automated translation solution that leverages AI/ML techniques and combines OCR and Neural Machine Translation (NMT) to translate Nepali and Sinhalese text into English. Our solution employs Tesseract OCR for character recognition and multilingual transformer models like NLLB and Marian MT for text translation. The solution consists of image preprocessing, text extraction using OCR, text normalisation, language identification, and translation steps. Our experiments show that the OCR step can achieve accuracies of 88% for Nepali and 85% for Sinhalese documents and obtains an average BLEU score of 0.72. We have created a robust and efficient tool that requires minimal manual effort and can be run offline for security reasons.

Copyright & License

Copyright © 2026 Authors retain the copyright of this article. This article is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

BibTeX

@article{197010,
        author = {Dr.Sandeep Kulkarni and Vaishnavi Salunkhe and Yati kumari and Arpita Yadav},
        title = {AI/ML-Based Multilingual Document Translation System: Nepali and Sinhalese to English},
        journal = {International Journal of Innovative Research in Technology},
        year = {2026},
        volume = {12},
        number = {11},
        pages = {4940-4946},
        issn = {2349-6002},
        url = {https://ijirt.org/article?manuscript=197010},
        abstract = {Language barriers cause significant limitations on the availability of regional literary works and historical texts, especially low-resourced ones like Nepali and Sinhalese. Most of these works are available either as scanned copies or hard copies and thus cannot be translated manually effectively and efficiently. In this paper, we introduce an automated translation solution that leverages AI/ML techniques and combines OCR and Neural Machine Translation (NMT) to translate Nepali and Sinhalese text into English. Our solution employs Tesseract OCR for character recognition and multilingual transformer models like NLLB and Marian MT for text translation. The solution consists of image preprocessing, text extraction using OCR, text normalisation, language identification, and translation steps. Our experiments show that the OCR step can achieve accuracies of 88% for Nepali and 85% for Sinhalese documents and obtains an average BLEU score of 0.72. We have created a robust and efficient tool that requires minimal manual effort and can be run offline for security reasons.},
        keywords = {Artificial Intelligence; OCR; Neural Machine Translation; Multilingual NLP; Low-resource Languages; Transformer Models},
        month = {April},
        }

Cite This Article

Kulkarni, D., & Salunkhe, V., & kumari, Y., & Yadav, A. (2026). AI/ML-Based Multilingual Document Translation System: Nepali and Sinhalese to English. International Journal of Innovative Research in Technology (IJIRT), 12(11), 4940–4946.

Related Articles