Markdown chapters.

2025-12-10 16:06:47 +01:00
parent 6ea2b4b6c2
commit a8198f0906
10 changed files with 1547 additions and 36 deletions
--- a/claude.md
+++ b/claude.md
@@ -0,0 +1,159 @@
+# Claude Code Context - Masters Thesis OCR Project
+
+## Project Overview
+
+This is a **Master's Thesis (TFM)** for UNIR's Master in Artificial Intelligence. The project focuses on **OCR hyperparameter optimization** using Ray Tune with Optuna for Spanish academic documents.
+
+**Author:** Sergio Jiménez Jiménez
+**University:** UNIR (Universidad Internacional de La Rioja)
+**Year:** 2025
+
+## Key Context
+
+### Why Hyperparameter Optimization Instead of Fine-tuning
+
+Due to **hardware limitations** (no dedicated GPU, CPU-only execution), the project pivoted from fine-tuning to hyperparameter optimization:
+- Fine-tuning deep learning models without GPU is prohibitively slow
+- Inference time is ~69 seconds/page on CPU
+- Hyperparameter optimization proved to be an effective alternative, achieving 80.9% CER reduction
+
+### Main Results
+
+| Model | CER | Character Accuracy |
+|-------|-----|-------------------|
+| PaddleOCR Baseline | 7.78% | 92.22% |
+| PaddleOCR-HyperAdjust | **1.49%** | **98.51%** |
+
+**Goal achieved:** CER < 2% (target was < 2%, result is 1.49%)
+
+### Optimal Configuration Found
+
+```python
+config_optimizada = {
+    "textline_orientation": True,           # CRITICAL - reduces CER ~70%
+    "use_doc_orientation_classify": False,
+    "use_doc_unwarping": False,
+    "text_det_thresh": 0.4690,
+    "text_det_box_thresh": 0.5412,
+    "text_det_unclip_ratio": 0.0,
+    "text_rec_score_thresh": 0.6350,
+}
+```
+
+### Key Findings
+
+1. `textline_orientation=True` is the most impactful parameter (reduces CER by 69.7%)
+2. `text_det_thresh` has -0.52 correlation with CER; values < 0.1 cause catastrophic failures
+3. Document correction modules (`use_doc_orientation_classify`, `use_doc_unwarping`) are unnecessary for digital PDFs
+
+## Repository Structure
+
+```
+MastersThesis/
+├── docs/                    # Thesis chapters in Markdown
+│   ├── 00_resumen.md        # Abstract (Spanish + English)
+│   ├── 01_introduccion.md   # Introduction
+│   ├── 02_contexto_estado_arte.md  # Context and State of Art
+│   ├── 03_objetivos_metodologia.md # Objectives and Methodology
+│   ├── 04_comparativa_soluciones.md # OCR Comparative Study
+│   ├── 05_optimizacion_hiperparametros.md # Ray Tune Optimization
+│   ├── 06_resultados_discusion.md  # Results and Discussion
+│   └── 07_conclusiones_trabajo_futuro.md # Conclusions
+├── src/
+│   ├── paddle_ocr_fine_tune_unir_raytune.ipynb  # Main experiment (64 trials)
+│   ├── paddle_ocr_tuning.py                      # CLI evaluation script
+│   ├── dataset_manager.py                        # ImageTextDataset class
+│   ├── prepare_dataset.ipynb                     # Dataset preparation
+│   └── raytune_paddle_subproc_results_20251207_192320.csv  # 64 trial results
+├── results/                 # Benchmark results CSVs
+├── instructions/            # UNIR PDF document used as dataset
+├── ocr_benchmark_notebook.ipynb  # Initial OCR benchmark
+└── README.md
+```
+
+## Important Data Files
+
+### Results CSV Files
+- `src/raytune_paddle_subproc_results_20251207_192320.csv` - 64 Ray Tune trials with configs and metrics
+- `results/ai_ocr_benchmark_finetune_results_20251206_113206.csv` - Per-page OCR benchmark results
+
+### Key Notebooks
+- `src/paddle_ocr_fine_tune_unir_raytune.ipynb` - Main Ray Tune experiment
+- `src/prepare_dataset.ipynb` - PDF to image/text conversion
+- `ocr_benchmark_notebook.ipynb` - EasyOCR vs PaddleOCR vs DocTR comparison
+
+## Technical Stack
+
+| Component | Version |
+|-----------|---------|
+| Python | 3.11.9 |
+| PaddlePaddle | 3.2.2 |
+| PaddleOCR | 3.3.2 |
+| Ray | 2.52.1 |
+| Optuna | 4.6.0 |
+
+## Pending Work
+
+### Priority Tasks
+1. **Validate on other document types** - Test optimal config on invoices, forms, contracts
+2. **Expand dataset** - Current dataset has only 24 pages
+3. **Complete unified thesis document** - Merge docs/ chapters into final UNIR format
+4. **Create presentation slides** - For thesis defense
+
+### Optional Extensions
+- Explore `text_det_unclip_ratio` parameter (was fixed at 0.0)
+- Compare with actual fine-tuning (if GPU access obtained)
+- Multi-objective optimization (CER + WER + inference time)
+
+## Guidelines for Claude
+
+### When Working on This Project
+
+1. **Be rigorous with data**: Only cite numbers from actual CSV files and notebook outputs. Do not fabricate comparison data.
+
+2. **Reference sources**: When discussing results, reference the specific files:
+   - Ray Tune results: `src/raytune_paddle_subproc_results_20251207_192320.csv`
+   - Benchmark results: `results/ai_ocr_benchmark_finetune_results_20251206_113206.csv`
+
+3. **Key files to read first**:
+   - This file (`claude.md`) for context
+   - `README.md` for current project state
+   - Relevant `docs/` chapter for specific topics
+
+4. **Language**: Documentation is in Spanish (thesis requirement), code comments in English.
+
+5. **Hardware context**: Remember this is CPU-only execution. Any suggestions about GPU training or real-time processing should acknowledge this limitation.
+
+### Common Tasks
+
+- **Adding new experiments**: Update `src/paddle_ocr_fine_tune_unir_raytune.ipynb`
+- **Updating documentation**: Edit files in `docs/`
+- **Dataset expansion**: Use `src/prepare_dataset.ipynb` as template
+- **Running evaluations**: Use `src/paddle_ocr_tuning.py` CLI
+
+## Experiment Details
+
+### Ray Tune Configuration
+```python
+tuner = tune.Tuner(
+    trainable_paddle_ocr,
+    tune_config=tune.TuneConfig(
+        metric="CER",
+        mode="min",
+        search_alg=OptunaSearch(),
+        num_samples=64,
+        max_concurrent_trials=2
+    )
+)
+```
+
+### Dataset
+- Source: UNIR TFE instructions PDF
+- Pages: 24
+- Resolution: 300 DPI
+- Ground truth: Extracted via PyMuPDF
+
+### Metrics
+- CER (Character Error Rate) - Primary metric
+- WER (Word Error Rate) - Secondary metric
+- Calculated using `jiwer` library