Automated ID and Certificate Data Extraction using Optical Character Recognition

  • Unique Paper ID: 182460
  • PageNo: 2164-2170
  • Abstract:
  • Optical Character Recognition (OCR) technology is essential for extracting text from scanned documents, images, and PDFs. Traditional OCR methods struggle with structured data extraction due to format variations and noise. This project presents an OCR-based Data Extraction System that combines Regular Expressions (Regex) and Machine Learning (ML) to enhance accuracy and reliability. Using Tesseract OCR, the system converts scanned text into a machine-readable format, followed by text cleaning to ensure structured output. Regex identifies key attributes, while an ML model predicts missing data when regex fails. Extracted data is structured in JSON and exported to Excel for integration and analysis. Error handling ensures smooth execution, making the system effective for applications in education, banking, and government. By combining rule-based and ML approaches, this solution improves efficiency, scalability, and accuracy in automated document processing.

Copyright & License

Copyright © 2026 Authors retain the copyright of this article. This article is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

BibTeX

@article{182460,
        author = {Gudala Bhavana and G.Tarshith and G.Vandana and D.Chandra Lekha},
        title = {Automated ID and Certificate Data Extraction using Optical Character Recognition},
        journal = {International Journal of Innovative Research in Technology},
        year = {2025},
        volume = {12},
        number = {2},
        pages = {2164-2170},
        issn = {2349-6002},
        url = {https://ijirt.org/article?manuscript=182460},
        abstract = {Optical Character Recognition (OCR) technology is essential for extracting text from scanned documents, images, and PDFs. Traditional OCR methods struggle with structured data extraction due to format variations and noise. This project presents an OCR-based Data Extraction System that combines Regular Expressions (Regex) and Machine Learning (ML) to enhance accuracy and reliability. Using Tesseract OCR, the system converts scanned text into a machine-readable format, followed by text cleaning to ensure structured output. Regex identifies key attributes, while an ML model predicts missing data when regex fails. Extracted data is structured in JSON and exported to Excel for integration and analysis. Error handling ensures smooth execution, making the system effective for applications in education, banking, and government. By combining rule-based and ML approaches, this solution improves efficiency, scalability, and accuracy in automated document processing.},
        keywords = {Optical Character Recognition (OCR), Data Extraction, Machine Learning (ML), Regular Expressions (Regex), Tesseract OCR, Document Processing, Structured Data Extraction, Text Recognition, Automation, PDF and Image Processing, Pattern Matching, JSON Data Structuring, Error Handling, AI-based Text Extraction, Information Retrieval},
        month = {July},
        }

Cite This Article

Bhavana, G., & G.Tarshith, , & G.Vandana, , & Lekha, D. (2025). Automated ID and Certificate Data Extraction using Optical Character Recognition. International Journal of Innovative Research in Technology (IJIRT), 12(2), 2164–2170.

Related Articles