Markdown chapters.
This commit is contained in:
159
claude.md
Normal file
159
claude.md
Normal file
@@ -0,0 +1,159 @@
|
||||
# Claude Code Context - Masters Thesis OCR Project
|
||||
|
||||
## Project Overview
|
||||
|
||||
This is a **Master's Thesis (TFM)** for UNIR's Master in Artificial Intelligence. The project focuses on **OCR hyperparameter optimization** using Ray Tune with Optuna for Spanish academic documents.
|
||||
|
||||
**Author:** Sergio Jiménez Jiménez
|
||||
**University:** UNIR (Universidad Internacional de La Rioja)
|
||||
**Year:** 2025
|
||||
|
||||
## Key Context
|
||||
|
||||
### Why Hyperparameter Optimization Instead of Fine-tuning
|
||||
|
||||
Due to **hardware limitations** (no dedicated GPU, CPU-only execution), the project pivoted from fine-tuning to hyperparameter optimization:
|
||||
- Fine-tuning deep learning models without GPU is prohibitively slow
|
||||
- Inference time is ~69 seconds/page on CPU
|
||||
- Hyperparameter optimization proved to be an effective alternative, achieving 80.9% CER reduction
|
||||
|
||||
### Main Results
|
||||
|
||||
| Model | CER | Character Accuracy |
|
||||
|-------|-----|-------------------|
|
||||
| PaddleOCR Baseline | 7.78% | 92.22% |
|
||||
| PaddleOCR-HyperAdjust | **1.49%** | **98.51%** |
|
||||
|
||||
**Goal achieved:** CER < 2% (target was < 2%, result is 1.49%)
|
||||
|
||||
### Optimal Configuration Found
|
||||
|
||||
```python
|
||||
config_optimizada = {
|
||||
"textline_orientation": True, # CRITICAL - reduces CER ~70%
|
||||
"use_doc_orientation_classify": False,
|
||||
"use_doc_unwarping": False,
|
||||
"text_det_thresh": 0.4690,
|
||||
"text_det_box_thresh": 0.5412,
|
||||
"text_det_unclip_ratio": 0.0,
|
||||
"text_rec_score_thresh": 0.6350,
|
||||
}
|
||||
```
|
||||
|
||||
### Key Findings
|
||||
|
||||
1. `textline_orientation=True` is the most impactful parameter (reduces CER by 69.7%)
|
||||
2. `text_det_thresh` has -0.52 correlation with CER; values < 0.1 cause catastrophic failures
|
||||
3. Document correction modules (`use_doc_orientation_classify`, `use_doc_unwarping`) are unnecessary for digital PDFs
|
||||
|
||||
## Repository Structure
|
||||
|
||||
```
|
||||
MastersThesis/
|
||||
├── docs/ # Thesis chapters in Markdown
|
||||
│ ├── 00_resumen.md # Abstract (Spanish + English)
|
||||
│ ├── 01_introduccion.md # Introduction
|
||||
│ ├── 02_contexto_estado_arte.md # Context and State of Art
|
||||
│ ├── 03_objetivos_metodologia.md # Objectives and Methodology
|
||||
│ ├── 04_comparativa_soluciones.md # OCR Comparative Study
|
||||
│ ├── 05_optimizacion_hiperparametros.md # Ray Tune Optimization
|
||||
│ ├── 06_resultados_discusion.md # Results and Discussion
|
||||
│ └── 07_conclusiones_trabajo_futuro.md # Conclusions
|
||||
├── src/
|
||||
│ ├── paddle_ocr_fine_tune_unir_raytune.ipynb # Main experiment (64 trials)
|
||||
│ ├── paddle_ocr_tuning.py # CLI evaluation script
|
||||
│ ├── dataset_manager.py # ImageTextDataset class
|
||||
│ ├── prepare_dataset.ipynb # Dataset preparation
|
||||
│ └── raytune_paddle_subproc_results_20251207_192320.csv # 64 trial results
|
||||
├── results/ # Benchmark results CSVs
|
||||
├── instructions/ # UNIR PDF document used as dataset
|
||||
├── ocr_benchmark_notebook.ipynb # Initial OCR benchmark
|
||||
└── README.md
|
||||
```
|
||||
|
||||
## Important Data Files
|
||||
|
||||
### Results CSV Files
|
||||
- `src/raytune_paddle_subproc_results_20251207_192320.csv` - 64 Ray Tune trials with configs and metrics
|
||||
- `results/ai_ocr_benchmark_finetune_results_20251206_113206.csv` - Per-page OCR benchmark results
|
||||
|
||||
### Key Notebooks
|
||||
- `src/paddle_ocr_fine_tune_unir_raytune.ipynb` - Main Ray Tune experiment
|
||||
- `src/prepare_dataset.ipynb` - PDF to image/text conversion
|
||||
- `ocr_benchmark_notebook.ipynb` - EasyOCR vs PaddleOCR vs DocTR comparison
|
||||
|
||||
## Technical Stack
|
||||
|
||||
| Component | Version |
|
||||
|-----------|---------|
|
||||
| Python | 3.11.9 |
|
||||
| PaddlePaddle | 3.2.2 |
|
||||
| PaddleOCR | 3.3.2 |
|
||||
| Ray | 2.52.1 |
|
||||
| Optuna | 4.6.0 |
|
||||
|
||||
## Pending Work
|
||||
|
||||
### Priority Tasks
|
||||
1. **Validate on other document types** - Test optimal config on invoices, forms, contracts
|
||||
2. **Expand dataset** - Current dataset has only 24 pages
|
||||
3. **Complete unified thesis document** - Merge docs/ chapters into final UNIR format
|
||||
4. **Create presentation slides** - For thesis defense
|
||||
|
||||
### Optional Extensions
|
||||
- Explore `text_det_unclip_ratio` parameter (was fixed at 0.0)
|
||||
- Compare with actual fine-tuning (if GPU access obtained)
|
||||
- Multi-objective optimization (CER + WER + inference time)
|
||||
|
||||
## Guidelines for Claude
|
||||
|
||||
### When Working on This Project
|
||||
|
||||
1. **Be rigorous with data**: Only cite numbers from actual CSV files and notebook outputs. Do not fabricate comparison data.
|
||||
|
||||
2. **Reference sources**: When discussing results, reference the specific files:
|
||||
- Ray Tune results: `src/raytune_paddle_subproc_results_20251207_192320.csv`
|
||||
- Benchmark results: `results/ai_ocr_benchmark_finetune_results_20251206_113206.csv`
|
||||
|
||||
3. **Key files to read first**:
|
||||
- This file (`claude.md`) for context
|
||||
- `README.md` for current project state
|
||||
- Relevant `docs/` chapter for specific topics
|
||||
|
||||
4. **Language**: Documentation is in Spanish (thesis requirement), code comments in English.
|
||||
|
||||
5. **Hardware context**: Remember this is CPU-only execution. Any suggestions about GPU training or real-time processing should acknowledge this limitation.
|
||||
|
||||
### Common Tasks
|
||||
|
||||
- **Adding new experiments**: Update `src/paddle_ocr_fine_tune_unir_raytune.ipynb`
|
||||
- **Updating documentation**: Edit files in `docs/`
|
||||
- **Dataset expansion**: Use `src/prepare_dataset.ipynb` as template
|
||||
- **Running evaluations**: Use `src/paddle_ocr_tuning.py` CLI
|
||||
|
||||
## Experiment Details
|
||||
|
||||
### Ray Tune Configuration
|
||||
```python
|
||||
tuner = tune.Tuner(
|
||||
trainable_paddle_ocr,
|
||||
tune_config=tune.TuneConfig(
|
||||
metric="CER",
|
||||
mode="min",
|
||||
search_alg=OptunaSearch(),
|
||||
num_samples=64,
|
||||
max_concurrent_trials=2
|
||||
)
|
||||
)
|
||||
```
|
||||
|
||||
### Dataset
|
||||
- Source: UNIR TFE instructions PDF
|
||||
- Pages: 24
|
||||
- Resolution: 300 DPI
|
||||
- Ground truth: Extracted via PyMuPDF
|
||||
|
||||
### Metrics
|
||||
- CER (Character Error Rate) - Primary metric
|
||||
- WER (Word Error Rate) - Secondary metric
|
||||
- Calculated using `jiwer` library
|
||||
Reference in New Issue
Block a user