30a352efea9dd6c8a640b82ac572f4ba8ba195c4
🧠 Intelligent OCR System for Scanned PDF Documents
Master’s Thesis – Software Development Project
Línea de trabajo: Percepción computacional & Aprendizaje automático
Author: Sergio Jiménez
Institution: (UNIR - Universidad Internacional de La Rioja] (https://www.unir.net/ingenieria/master-inteligencia-artificial/)
Date: 2025
📘 Overview
This project develops an intelligent system for text extraction from scanned PDF documents, combining computer vision techniques and modern OCR models based on deep learning.
The goal is to overcome the limitations of traditional OCR tools (e.g., Tesseract) when dealing with low-quality, skewed, or noisy scanned documents, particularly in Spanish.
🎯 Objectives
- Develop a modular OCR pipeline that processes scanned PDFs end-to-end.
- Compare classical OCR tools with state-of-the-art deep learning approaches (EasyOCR, TrOCR, CRNN).
- Evaluate performance using Character Error Rate (CER) and Word Error Rate (WER).
- Provide a CLI-based demonstration tool and analysis module for automated evaluation.
🧩 System Architecture
TODO
Description
Languages
Jupyter Notebook
99.9%
Python
0.1%