5.8 KiB
Claude Code Context - Masters Thesis OCR Project
Project Overview
This is a Master's Thesis (TFM) for UNIR's Master in Artificial Intelligence. The project focuses on OCR hyperparameter optimization using Ray Tune with Optuna for Spanish academic documents.
Author: Sergio Jiménez Jiménez University: UNIR (Universidad Internacional de La Rioja) Year: 2025
Key Context
Why Hyperparameter Optimization Instead of Fine-tuning
Due to hardware limitations (no dedicated GPU, CPU-only execution), the project pivoted from fine-tuning to hyperparameter optimization:
- Fine-tuning deep learning models without GPU is prohibitively slow
- Inference time is ~69 seconds/page on CPU
- Hyperparameter optimization proved to be an effective alternative, achieving 80.9% CER reduction
Main Results
| Model | CER | Character Accuracy |
|---|---|---|
| PaddleOCR Baseline | 7.78% | 92.22% |
| PaddleOCR-HyperAdjust | 1.49% | 98.51% |
Goal achieved: CER < 2% (target was < 2%, result is 1.49%)
Optimal Configuration Found
config_optimizada = {
"textline_orientation": True, # CRITICAL - reduces CER ~70%
"use_doc_orientation_classify": False,
"use_doc_unwarping": False,
"text_det_thresh": 0.4690,
"text_det_box_thresh": 0.5412,
"text_det_unclip_ratio": 0.0,
"text_rec_score_thresh": 0.6350,
}
Key Findings
textline_orientation=Trueis the most impactful parameter (reduces CER by 69.7%)text_det_threshhas -0.52 correlation with CER; values < 0.1 cause catastrophic failures- Document correction modules (
use_doc_orientation_classify,use_doc_unwarping) are unnecessary for digital PDFs
Repository Structure
MastersThesis/
├── docs/ # Thesis chapters in Markdown
│ ├── 00_resumen.md # Abstract (Spanish + English)
│ ├── 01_introduccion.md # Introduction
│ ├── 02_contexto_estado_arte.md # Context and State of Art
│ ├── 03_objetivos_metodologia.md # Objectives and Methodology
│ ├── 04_comparativa_soluciones.md # OCR Comparative Study
│ ├── 05_optimizacion_hiperparametros.md # Ray Tune Optimization
│ ├── 06_resultados_discusion.md # Results and Discussion
│ └── 07_conclusiones_trabajo_futuro.md # Conclusions
├── src/
│ ├── paddle_ocr_fine_tune_unir_raytune.ipynb # Main experiment (64 trials)
│ ├── paddle_ocr_tuning.py # CLI evaluation script
│ ├── dataset_manager.py # ImageTextDataset class
│ ├── prepare_dataset.ipynb # Dataset preparation
│ └── raytune_paddle_subproc_results_20251207_192320.csv # 64 trial results
├── results/ # Benchmark results CSVs
├── instructions/ # UNIR PDF document used as dataset
├── ocr_benchmark_notebook.ipynb # Initial OCR benchmark
└── README.md
Important Data Files
Results CSV Files
src/raytune_paddle_subproc_results_20251207_192320.csv- 64 Ray Tune trials with configs and metricsresults/ai_ocr_benchmark_finetune_results_20251206_113206.csv- Per-page OCR benchmark results
Key Notebooks
src/paddle_ocr_fine_tune_unir_raytune.ipynb- Main Ray Tune experimentsrc/prepare_dataset.ipynb- PDF to image/text conversionocr_benchmark_notebook.ipynb- EasyOCR vs PaddleOCR vs DocTR comparison
Technical Stack
| Component | Version |
|---|---|
| Python | 3.11.9 |
| PaddlePaddle | 3.2.2 |
| PaddleOCR | 3.3.2 |
| Ray | 2.52.1 |
| Optuna | 4.6.0 |
Pending Work
Priority Tasks
- Validate on other document types - Test optimal config on invoices, forms, contracts
- Expand dataset - Current dataset has only 24 pages
- Complete unified thesis document - Merge docs/ chapters into final UNIR format
- Create presentation slides - For thesis defense
Optional Extensions
- Explore
text_det_unclip_ratioparameter (was fixed at 0.0) - Compare with actual fine-tuning (if GPU access obtained)
- Multi-objective optimization (CER + WER + inference time)
Guidelines for Claude
When Working on This Project
-
Be rigorous with data: Only cite numbers from actual CSV files and notebook outputs. Do not fabricate comparison data.
-
Reference sources: When discussing results, reference the specific files:
- Ray Tune results:
src/raytune_paddle_subproc_results_20251207_192320.csv - Benchmark results:
results/ai_ocr_benchmark_finetune_results_20251206_113206.csv
- Ray Tune results:
-
Key files to read first:
- This file (
claude.md) for context README.mdfor current project state- Relevant
docs/chapter for specific topics
- This file (
-
Language: Documentation is in Spanish (thesis requirement), code comments in English.
-
Hardware context: Remember this is CPU-only execution. Any suggestions about GPU training or real-time processing should acknowledge this limitation.
Common Tasks
- Adding new experiments: Update
src/paddle_ocr_fine_tune_unir_raytune.ipynb - Updating documentation: Edit files in
docs/ - Dataset expansion: Use
src/prepare_dataset.ipynbas template - Running evaluations: Use
src/paddle_ocr_tuning.pyCLI
Experiment Details
Ray Tune Configuration
tuner = tune.Tuner(
trainable_paddle_ocr,
tune_config=tune.TuneConfig(
metric="CER",
mode="min",
search_alg=OptunaSearch(),
num_samples=64,
max_concurrent_trials=2
)
)
Dataset
- Source: UNIR TFE instructions PDF
- Pages: 24
- Resolution: 300 DPI
- Ground truth: Extracted via PyMuPDF
Metrics
- CER (Character Error Rate) - Primary metric
- WER (Word Error Rate) - Secondary metric
- Calculated using
jiwerlibrary