Files
MastersThesis/claude.md
2025-12-10 16:06:47 +01:00

5.8 KiB

Claude Code Context - Masters Thesis OCR Project

Project Overview

This is a Master's Thesis (TFM) for UNIR's Master in Artificial Intelligence. The project focuses on OCR hyperparameter optimization using Ray Tune with Optuna for Spanish academic documents.

Author: Sergio Jiménez Jiménez University: UNIR (Universidad Internacional de La Rioja) Year: 2025

Key Context

Why Hyperparameter Optimization Instead of Fine-tuning

Due to hardware limitations (no dedicated GPU, CPU-only execution), the project pivoted from fine-tuning to hyperparameter optimization:

  • Fine-tuning deep learning models without GPU is prohibitively slow
  • Inference time is ~69 seconds/page on CPU
  • Hyperparameter optimization proved to be an effective alternative, achieving 80.9% CER reduction

Main Results

Model CER Character Accuracy
PaddleOCR Baseline 7.78% 92.22%
PaddleOCR-HyperAdjust 1.49% 98.51%

Goal achieved: CER < 2% (target was < 2%, result is 1.49%)

Optimal Configuration Found

config_optimizada = {
    "textline_orientation": True,           # CRITICAL - reduces CER ~70%
    "use_doc_orientation_classify": False,
    "use_doc_unwarping": False,
    "text_det_thresh": 0.4690,
    "text_det_box_thresh": 0.5412,
    "text_det_unclip_ratio": 0.0,
    "text_rec_score_thresh": 0.6350,
}

Key Findings

  1. textline_orientation=True is the most impactful parameter (reduces CER by 69.7%)
  2. text_det_thresh has -0.52 correlation with CER; values < 0.1 cause catastrophic failures
  3. Document correction modules (use_doc_orientation_classify, use_doc_unwarping) are unnecessary for digital PDFs

Repository Structure

MastersThesis/
├── docs/                    # Thesis chapters in Markdown
│   ├── 00_resumen.md        # Abstract (Spanish + English)
│   ├── 01_introduccion.md   # Introduction
│   ├── 02_contexto_estado_arte.md  # Context and State of Art
│   ├── 03_objetivos_metodologia.md # Objectives and Methodology
│   ├── 04_comparativa_soluciones.md # OCR Comparative Study
│   ├── 05_optimizacion_hiperparametros.md # Ray Tune Optimization
│   ├── 06_resultados_discusion.md  # Results and Discussion
│   └── 07_conclusiones_trabajo_futuro.md # Conclusions
├── src/
│   ├── paddle_ocr_fine_tune_unir_raytune.ipynb  # Main experiment (64 trials)
│   ├── paddle_ocr_tuning.py                      # CLI evaluation script
│   ├── dataset_manager.py                        # ImageTextDataset class
│   ├── prepare_dataset.ipynb                     # Dataset preparation
│   └── raytune_paddle_subproc_results_20251207_192320.csv  # 64 trial results
├── results/                 # Benchmark results CSVs
├── instructions/            # UNIR PDF document used as dataset
├── ocr_benchmark_notebook.ipynb  # Initial OCR benchmark
└── README.md

Important Data Files

Results CSV Files

  • src/raytune_paddle_subproc_results_20251207_192320.csv - 64 Ray Tune trials with configs and metrics
  • results/ai_ocr_benchmark_finetune_results_20251206_113206.csv - Per-page OCR benchmark results

Key Notebooks

  • src/paddle_ocr_fine_tune_unir_raytune.ipynb - Main Ray Tune experiment
  • src/prepare_dataset.ipynb - PDF to image/text conversion
  • ocr_benchmark_notebook.ipynb - EasyOCR vs PaddleOCR vs DocTR comparison

Technical Stack

Component Version
Python 3.11.9
PaddlePaddle 3.2.2
PaddleOCR 3.3.2
Ray 2.52.1
Optuna 4.6.0

Pending Work

Priority Tasks

  1. Validate on other document types - Test optimal config on invoices, forms, contracts
  2. Expand dataset - Current dataset has only 24 pages
  3. Complete unified thesis document - Merge docs/ chapters into final UNIR format
  4. Create presentation slides - For thesis defense

Optional Extensions

  • Explore text_det_unclip_ratio parameter (was fixed at 0.0)
  • Compare with actual fine-tuning (if GPU access obtained)
  • Multi-objective optimization (CER + WER + inference time)

Guidelines for Claude

When Working on This Project

  1. Be rigorous with data: Only cite numbers from actual CSV files and notebook outputs. Do not fabricate comparison data.

  2. Reference sources: When discussing results, reference the specific files:

    • Ray Tune results: src/raytune_paddle_subproc_results_20251207_192320.csv
    • Benchmark results: results/ai_ocr_benchmark_finetune_results_20251206_113206.csv
  3. Key files to read first:

    • This file (claude.md) for context
    • README.md for current project state
    • Relevant docs/ chapter for specific topics
  4. Language: Documentation is in Spanish (thesis requirement), code comments in English.

  5. Hardware context: Remember this is CPU-only execution. Any suggestions about GPU training or real-time processing should acknowledge this limitation.

Common Tasks

  • Adding new experiments: Update src/paddle_ocr_fine_tune_unir_raytune.ipynb
  • Updating documentation: Edit files in docs/
  • Dataset expansion: Use src/prepare_dataset.ipynb as template
  • Running evaluations: Use src/paddle_ocr_tuning.py CLI

Experiment Details

Ray Tune Configuration

tuner = tune.Tuner(
    trainable_paddle_ocr,
    tune_config=tune.TuneConfig(
        metric="CER",
        mode="min",
        search_alg=OptunaSearch(),
        num_samples=64,
        max_concurrent_trials=2
    )
)

Dataset

  • Source: UNIR TFE instructions PDF
  • Pages: 24
  • Resolution: 300 DPI
  • Ground truth: Extracted via PyMuPDF

Metrics

  • CER (Character Error Rate) - Primary metric
  • WER (Word Error Rate) - Secondary metric
  • Calculated using jiwer library