Documentation review. (#5)

2026-01-20 14:33:46 +00:00
parent c7ed7b2b9c
commit 9ee2490097
56 changed files with 2182 additions and 945 deletions
--- a/claude.md
+++ b/claude.md
@@ -12,39 +12,41 @@ This is a **Master's Thesis (TFM)** for UNIR's Master in Artificial Intelligence

 ### Why Hyperparameter Optimization Instead of Fine-tuning

-Due to **hardware limitations** (no dedicated GPU, CPU-only execution), the project pivoted from fine-tuning to hyperparameter optimization:
- Fine-tuning deep learning models without GPU is prohibitively slow
- Inference time is ~69 seconds/page on CPU
- Hyperparameter optimization proved to be an effective alternative, achieving 80.9% CER reduction
+The project chose **hyperparameter optimization** over fine-tuning because:
+- Fine-tuning requires extensive labeled datasets specific to the domain
+- Hyperparameter tuning can improve pretrained models without retraining
+- GPU acceleration (RTX 3060) enables efficient exploration of hyperparameter space

-### Main Results
+### Main Results (GPU - Jan 2026)

 | Model | CER | Character Accuracy |
 |-------|-----|-------------------|
-| PaddleOCR Baseline | 7.78% | 92.22% |
-| PaddleOCR-HyperAdjust | **1.49%** | **98.51%** |
+| PaddleOCR Baseline | 8.85% | 91.15% |
+| PaddleOCR-HyperAdjust (full dataset) | **7.72%** | **92.28%** |
+| PaddleOCR-HyperAdjust (best trial) | **0.79%** | **99.21%** |

-**Goal achieved:** CER < 2% (target was < 2%, result is 1.49%)
+**Goal status:** CER < 2% achieved in best trial (0.79%). Full dataset shows 12.8% improvement.

-### Optimal Configuration Found
+### Optimal Configuration Found (GPU)

 ```python
 config_optimizada = {
-    "textline_orientation": True,           # CRITICAL - reduces CER ~70%
-    "use_doc_orientation_classify": False,
+    "textline_orientation": True,           # CRITICAL for complex layouts
+    "use_doc_orientation_classify": True,   # Improves document orientation
    "use_doc_unwarping": False,
-    "text_det_thresh": 0.4690,
-    "text_det_box_thresh": 0.5412,
+    "text_det_thresh": 0.0462,              # -0.52 correlation with CER
+    "text_det_box_thresh": 0.4862,
    "text_det_unclip_ratio": 0.0,
-    "text_rec_score_thresh": 0.6350,
+    "text_rec_score_thresh": 0.5658,
 }
 ```

 ### Key Findings

-1. `textline_orientation=True` is the most impactful parameter (reduces CER by 69.7%)
-2. `text_det_thresh` has -0.52 correlation with CER; values < 0.1 cause catastrophic failures
-3. Document correction modules (`use_doc_orientation_classify`, `use_doc_unwarping`) are unnecessary for digital PDFs
+1. `textline_orientation=True` is critical for documents with mixed layouts
+2. `use_doc_orientation_classify=True` improves document orientation detection in GPU config
+3. `text_det_thresh` has -0.52 correlation with CER; values < 0.01 cause catastrophic failures
+4. `use_doc_unwarping=False` is optimal for digital PDFs (unnecessary processing)

 ## Repository Structure

@@ -99,13 +101,18 @@ The template (`plantilla_individual.pdf`) requires **5 chapters**. The docs/ fil

 ## Important Data Files

-### Results CSV Files
- `src/raytune_paddle_subproc_results_20251207_192320.csv` - 64 Ray Tune trials with configs and metrics (PRIMARY DATA SOURCE)
+### Results CSV Files (GPU - PRIMARY)
+- `src/results/raytune_paddle_results_20260119_122609.csv` - 64 Ray Tune trials PaddleOCR GPU (PRIMARY)
+- `src/results/raytune_easyocr_results_20260119_120204.csv` - 64 Ray Tune trials EasyOCR GPU
+- `src/results/raytune_doctr_results_20260119_121445.csv` - 64 Ray Tune trials DocTR GPU

-### Key Notebooks
- `src/paddle_ocr_fine_tune_unir_raytune.ipynb` - Main Ray Tune experiment
- `src/prepare_dataset.ipynb` - PDF to image/text conversion
- `ocr_benchmark_notebook.ipynb` - EasyOCR vs PaddleOCR vs DocTR comparison
+### Results CSV Files (CPU - time reference only)
+- `src/raytune_paddle_subproc_results_20251207_192320.csv` - CPU execution for time comparison (69.4s/page vs 0.84s/page GPU)
+
+### Key Scripts
+- `src/run_tuning.py` - Main Ray Tune optimization script
+- `src/raytune/raytune_ocr.py` - Ray Tune utilities and search spaces
+- `src/paddle_ocr/paddle_ocr_tuning_rest.py` - PaddleOCR REST API

 ## Technical Stack

@@ -128,13 +135,13 @@ The template (`plantilla_individual.pdf`) requires **5 chapters**. The docs/ fil

 ### Priority Tasks
 1. **Validate on other document types** - Test optimal config on invoices, forms, contracts
-2. **Expand dataset** - Current dataset has only 24 pages
+2. **Use larger tuning subset** - Current 5 pages caused overfitting; recommend 15-20 pages
 3. **Create presentation slides** - For thesis defense
 4. **Final document review** - Open in Word, update indices (Ctrl+A, F9), verify formatting

 ### Optional Extensions
 - Explore `text_det_unclip_ratio` parameter (was fixed at 0.0)
- Compare with actual fine-tuning (if GPU access obtained)
+- Compare with actual fine-tuning
 - Multi-objective optimization (CER + WER + inference time)

 ## Thesis Document Generation