# 🧠 Intelligent OCR System for Scanned PDF Documents **Master’s Thesis – Software Development Project** **Línea de trabajo:** Percepción computacional & Aprendizaje automático **Author:** Sergio Jiménez **Institution:** [UNIR - Universidad Internacional de La Rioja](https://www.unir.net/ingenieria/master-inteligencia-artificial/) **Date:** 2025 --- ## 📘 Overview This project develops an **intelligent system for text extraction from scanned PDF documents**, combining **computer vision techniques** and **modern OCR models based on deep learning**. The goal is to overcome the limitations of traditional OCR tools (e.g., Tesseract) when dealing with **low-quality, skewed, or noisy scanned documents**, particularly in **Spanish**. --- ## 🎯 Objectives - Develop a **modular OCR pipeline** that processes scanned PDFs end-to-end. - Compare classical OCR tools with **state-of-the-art deep learning approaches** (EasyOCR, TrOCR, CRNN). - Evaluate performance using **Character Error Rate (CER)** and **Word Error Rate (WER)**. - Provide a **CLI-based demonstration tool** and analysis module for automated evaluation. --- ## 🧩 System Architecture TODO