2025-10-08 09:32:29 +02:00
2025-10-08 09:45:22 +02:00
2025-10-08 09:46:57 +02:00
2025-10-08 09:45:22 +02:00

🧠 Intelligent OCR System for Scanned PDF Documents

Masters Thesis Software Development Project
Línea de trabajo: Percepción computacional & Aprendizaje automático
Author: Sergio Jiménez
Institution: UNIR - Universidad Internacional de La Rioja

Date: 2025


📘 Overview

This project develops an intelligent system for text extraction from scanned PDF documents, combining computer vision techniques and modern OCR models based on deep learning.
The goal is to overcome the limitations of traditional OCR tools (e.g., Tesseract) when dealing with low-quality, skewed, or noisy scanned documents, particularly in Spanish.


🎯 Objectives

  • Develop a modular OCR pipeline that processes scanned PDFs end-to-end.
  • Compare classical OCR tools with state-of-the-art deep learning approaches (EasyOCR, TrOCR, CRNN).
  • Evaluate performance using Character Error Rate (CER) and Word Error Rate (WER).
  • Provide a CLI-based demonstration tool and analysis module for automated evaluation.

🧩 System Architecture

TODO

Description
No description provided
Readme 5.8 MiB
Languages
Jupyter Notebook 99.9%
Python 0.1%