# Claude Code Context - Masters Thesis OCR Project ## Project Overview This is a **Master's Thesis (TFM)** for UNIR's Master in Artificial Intelligence. The project focuses on **OCR hyperparameter optimization** using Ray Tune with Optuna for Spanish academic documents. **Author:** Sergio Jiménez Jiménez **University:** UNIR (Universidad Internacional de La Rioja) **Year:** 2025 ## Key Context ### Why Hyperparameter Optimization Instead of Fine-tuning Due to **hardware limitations** (no dedicated GPU, CPU-only execution), the project pivoted from fine-tuning to hyperparameter optimization: - Fine-tuning deep learning models without GPU is prohibitively slow - Inference time is ~69 seconds/page on CPU - Hyperparameter optimization proved to be an effective alternative, achieving 80.9% CER reduction ### Main Results | Model | CER | Character Accuracy | |-------|-----|-------------------| | PaddleOCR Baseline | 7.78% | 92.22% | | PaddleOCR-HyperAdjust | **1.49%** | **98.51%** | **Goal achieved:** CER < 2% (target was < 2%, result is 1.49%) ### Optimal Configuration Found ```python config_optimizada = { "textline_orientation": True, # CRITICAL - reduces CER ~70% "use_doc_orientation_classify": False, "use_doc_unwarping": False, "text_det_thresh": 0.4690, "text_det_box_thresh": 0.5412, "text_det_unclip_ratio": 0.0, "text_rec_score_thresh": 0.6350, } ``` ### Key Findings 1. `textline_orientation=True` is the most impactful parameter (reduces CER by 69.7%) 2. `text_det_thresh` has -0.52 correlation with CER; values < 0.1 cause catastrophic failures 3. Document correction modules (`use_doc_orientation_classify`, `use_doc_unwarping`) are unnecessary for digital PDFs ## Repository Structure ``` MastersThesis/ ├── docs/ # Thesis chapters in Markdown │ ├── 00_resumen.md # Abstract (Spanish + English) │ ├── 01_introduccion.md # Introduction │ ├── 02_contexto_estado_arte.md # Context and State of Art │ ├── 03_objetivos_metodologia.md # Objectives and Methodology │ ├── 04_comparativa_soluciones.md # OCR Comparative Study │ ├── 05_optimizacion_hiperparametros.md # Ray Tune Optimization │ ├── 06_resultados_discusion.md # Results and Discussion │ └── 07_conclusiones_trabajo_futuro.md # Conclusions ├── src/ │ ├── paddle_ocr_fine_tune_unir_raytune.ipynb # Main experiment (64 trials) │ ├── paddle_ocr_tuning.py # CLI evaluation script │ ├── dataset_manager.py # ImageTextDataset class │ ├── prepare_dataset.ipynb # Dataset preparation │ └── raytune_paddle_subproc_results_20251207_192320.csv # 64 trial results ├── results/ # Benchmark results CSVs ├── instructions/ # UNIR PDF document used as dataset ├── ocr_benchmark_notebook.ipynb # Initial OCR benchmark └── README.md ``` ## Important Data Files ### Results CSV Files - `src/raytune_paddle_subproc_results_20251207_192320.csv` - 64 Ray Tune trials with configs and metrics - `results/ai_ocr_benchmark_finetune_results_20251206_113206.csv` - Per-page OCR benchmark results ### Key Notebooks - `src/paddle_ocr_fine_tune_unir_raytune.ipynb` - Main Ray Tune experiment - `src/prepare_dataset.ipynb` - PDF to image/text conversion - `ocr_benchmark_notebook.ipynb` - EasyOCR vs PaddleOCR vs DocTR comparison ## Technical Stack | Component | Version | |-----------|---------| | Python | 3.11.9 | | PaddlePaddle | 3.2.2 | | PaddleOCR | 3.3.2 | | Ray | 2.52.1 | | Optuna | 4.6.0 | ## Pending Work ### Priority Tasks 1. **Validate on other document types** - Test optimal config on invoices, forms, contracts 2. **Expand dataset** - Current dataset has only 24 pages 3. **Complete unified thesis document** - Merge docs/ chapters into final UNIR format 4. **Create presentation slides** - For thesis defense ### Optional Extensions - Explore `text_det_unclip_ratio` parameter (was fixed at 0.0) - Compare with actual fine-tuning (if GPU access obtained) - Multi-objective optimization (CER + WER + inference time) ## Guidelines for Claude ### When Working on This Project 1. **Be rigorous with data**: Only cite numbers from actual CSV files and notebook outputs. Do not fabricate comparison data. 2. **Reference sources**: When discussing results, reference the specific files: - Ray Tune results: `src/raytune_paddle_subproc_results_20251207_192320.csv` - Benchmark results: `results/ai_ocr_benchmark_finetune_results_20251206_113206.csv` 3. **Key files to read first**: - This file (`claude.md`) for context - `README.md` for current project state - Relevant `docs/` chapter for specific topics 4. **Language**: Documentation is in Spanish (thesis requirement), code comments in English. 5. **Hardware context**: Remember this is CPU-only execution. Any suggestions about GPU training or real-time processing should acknowledge this limitation. ### Common Tasks - **Adding new experiments**: Update `src/paddle_ocr_fine_tune_unir_raytune.ipynb` - **Updating documentation**: Edit files in `docs/` - **Dataset expansion**: Use `src/prepare_dataset.ipynb` as template - **Running evaluations**: Use `src/paddle_ocr_tuning.py` CLI ## Experiment Details ### Ray Tune Configuration ```python tuner = tune.Tuner( trainable_paddle_ocr, tune_config=tune.TuneConfig( metric="CER", mode="min", search_alg=OptunaSearch(), num_samples=64, max_concurrent_trials=2 ) ) ``` ### Dataset - Source: UNIR TFE instructions PDF - Pages: 24 - Resolution: 300 DPI - Ground truth: Extracted via PyMuPDF ### Metrics - CER (Character Error Rate) - Primary metric - WER (Word Error Rate) - Secondary metric - Calculated using `jiwer` library