A Multi-Engine OCR Framework for Accurate Text Extraction from Scanned and Printed Images Using Preprocessing Enhancements

  • Unique Paper ID: 184238
  • Volume: 12
  • Issue: 4
  • PageNo: 696-702
  • Abstract:
  • In the fields of digitization, document archiving, and automated information retrieval, the extraction of textual data from scanned printed documents continues to be an essential requirement. Traditional OCR systems often face challenges with low-quality or degraded images, where factors such as poor resolution, background noise, and uneven illumination significantly reduce recognition accuracy. The project addresses the challenges by adopting a multi-engine OCR strategy that integrates Tesseract, EasyOCR, and Keras-OCR to enhance reliability and robustness. Scanned inputs are transformed into high-contrast formats using a specialized preprocessing pipeline that includes adaptive thresholding, noise reduction, grayscale conversion, and binarization to better distinguish text from background. This method, in contrast to conventional pipelines, standardizes documents into formats with white text on a black background, making character boundaries more visible. The recognition quality, processing time, and extracted textual fidelity of each image is compared after they are independently processed by the three OCR engines. By leveraging the complementary strengths of traditional and deep learning–based OCR systems, this hybrid framework offers a more adaptable and accurate solution for diverse document types and varying scan qualities, contributing to improved digitization workflows and reliable text recognition.

Copyright & License

Copyright © 2025 Authors retain the copyright of this article. This article is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

BibTeX

@article{184238,
        author = {Prof.Santhosh SG and Chandana BN},
        title = {A Multi-Engine OCR Framework for Accurate Text Extraction from Scanned and Printed Images Using Preprocessing Enhancements},
        journal = {International Journal of Innovative Research in Technology},
        year = {2025},
        volume = {12},
        number = {4},
        pages = {696-702},
        issn = {2349-6002},
        url = {https://ijirt.org/article?manuscript=184238},
        abstract = {In the fields of digitization, document archiving, and automated information retrieval, the extraction of textual data from scanned printed documents continues to be an essential requirement. Traditional OCR systems often face challenges with low-quality or degraded images, where factors such as poor resolution, background noise, and uneven illumination significantly reduce recognition accuracy. The project addresses the challenges by adopting a multi-engine OCR strategy that integrates Tesseract, EasyOCR, and Keras-OCR to enhance reliability and robustness. Scanned inputs are transformed into high-contrast formats using a specialized preprocessing pipeline that includes adaptive thresholding, noise reduction, grayscale conversion, and binarization to better distinguish text from background. This method, in contrast to conventional pipelines, standardizes documents into formats with white text on a black background, making character boundaries more visible. The recognition quality, processing time, and extracted textual fidelity of each image is compared after they are independently processed by the three OCR engines. By leveraging the complementary strengths of traditional and deep learning–based OCR systems, this hybrid framework offers a more adaptable and accurate solution for diverse document types and varying scan qualities, contributing to improved digitization workflows and reliable text recognition.},
        keywords = {EasyOCR, Keras-OCR, Optical Character Recognition (OCR), Tesseract, Thresholding},
        month = {September},
        }

Related Articles