Files
MastersThesis/README.md

32 lines
1.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# 🧠 Intelligent OCR System for Scanned PDF Documents
**Masters Thesis Software Development Project**
**Línea de trabajo:** Percepción computacional & Aprendizaje automático
**Author:** Sergio Jiménez
**Institution:** [UNIR - Universidad Internacional de La Rioja](https://www.unir.net/ingenieria/master-inteligencia-artificial/)
**Date:** 2025
---
## 📘 Overview
This project develops an **intelligent system for text extraction from scanned PDF documents**, combining **computer vision techniques** and **modern OCR models based on deep learning**.
The goal is to overcome the limitations of traditional OCR tools (e.g., Tesseract) when dealing with **low-quality, skewed, or noisy scanned documents**, particularly in **Spanish**.
---
## 🎯 Objectives
- Develop a **modular OCR pipeline** that processes scanned PDFs end-to-end.
- Compare classical OCR tools with **state-of-the-art deep learning approaches** (EasyOCR, TrOCR, CRNN).
- Evaluate performance using **Character Error Rate (CER)** and **Word Error Rate (WER)**.
- Provide a **CLI-based demonstration tool** and analysis module for automated evaluation.
---
## 🧩 System Architecture
TODO