Master’s Thesis – Software Development Project
Línea de trabajo: Percepción computacional & Aprendizaje automático
Author: Sergio Jiménez
Institution: (UNIR - Universidad Internacional de La Rioja] (https://www.unir.net/ingenieria/master-inteligencia-artificial/) Date: 2025

📘 Overview

This project develops an intelligent system for text extraction from scanned PDF documents, combining computer vision techniques and modern OCR models based on deep learning.
The goal is to overcome the limitations of traditional OCR tools (e.g., Tesseract) when dealing with low-quality, skewed, or noisy scanned documents, particularly in Spanish.

🎯 Objectives

Develop a modular OCR pipeline that processes scanned PDFs end-to-end.
Compare classical OCR tools with state-of-the-art deep learning approaches (EasyOCR, TrOCR, CRNN).
Evaluate performance using Character Error Rate (CER) and Word Error Rate (WER).
Provide a CLI-based demonstration tool and analysis module for automated evaluation.

🧩 System Architecture

TODO

Languages

Jupyter Notebook 84.9%

HTML 12.9%

Python 1.9%

JavaScript 0.1%

CSS 0.1%

README.md Unescape Escape

🧠 Intelligent OCR System for Scanned PDF Documents

📘 Overview

🎯 Objectives

🧩 System Architecture

README.md