diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..b585786 --- /dev/null +++ b/.gitignore @@ -0,0 +1 @@ +~$*.docx diff --git a/README.md b/README.md index b5d0e0f..6bf17d6 100644 --- a/README.md +++ b/README.md @@ -1,2 +1,30 @@ -# MastersThesis +# 🧠 Intelligent OCR System for Scanned PDF Documents + +**Master’s Thesis – Software Development Project** +**Línea de trabajo:** Percepción computacional & Aprendizaje automático +**Author:** Sergio Jiménez +**Institution:** (UNIR - Universidad Internacional de La Rioja] (https://www.unir.net/ingenieria/master-inteligencia-artificial/) +**Date:** 2025 + +--- + +## 📘 Overview + +This project develops an **intelligent system for text extraction from scanned PDF documents**, combining **computer vision techniques** and **modern OCR models based on deep learning**. +The goal is to overcome the limitations of traditional OCR tools (e.g., Tesseract) when dealing with **low-quality, skewed, or noisy scanned documents**, particularly in **Spanish**. + +--- + +## 🎯 Objectives + +- Develop a **modular OCR pipeline** that processes scanned PDFs end-to-end. +- Compare classical OCR tools with **state-of-the-art deep learning approaches** (EasyOCR, TrOCR, CRNN). +- Evaluate performance using **Character Error Rate (CER)** and **Word Error Rate (WER)**. +- Provide a **CLI-based demonstration tool** and analysis module for automated evaluation. + +--- + +## 🧩 System Architecture + +TODO diff --git a/thesis_report.docx b/thesis_report.docx new file mode 100644 index 0000000..16a9ac5 Binary files /dev/null and b/thesis_report.docx differ