Documentation review. (#5)
All checks were successful
build_docker / essential (push) Successful in 0s
build_docker / build_paddle_ocr (push) Successful in 5m28s
build_docker / build_paddle_ocr_gpu (push) Successful in 21m16s
build_docker / build_easyocr (push) Successful in 15m52s
build_docker / build_easyocr_gpu (push) Successful in 18m22s
build_docker / build_doctr (push) Successful in 19m3s
build_docker / build_raytune (push) Successful in 3m34s
build_docker / build_doctr_gpu (push) Successful in 13m56s

This commit was merged in pull request #5.
This commit is contained in:
2026-01-20 14:33:46 +00:00
committed by Sergio Jimenez Jimenez
parent c7ed7b2b9c
commit 9ee2490097
56 changed files with 2182 additions and 945 deletions

View File

@@ -12,39 +12,41 @@ This is a **Master's Thesis (TFM)** for UNIR's Master in Artificial Intelligence
### Why Hyperparameter Optimization Instead of Fine-tuning
Due to **hardware limitations** (no dedicated GPU, CPU-only execution), the project pivoted from fine-tuning to hyperparameter optimization:
- Fine-tuning deep learning models without GPU is prohibitively slow
- Inference time is ~69 seconds/page on CPU
- Hyperparameter optimization proved to be an effective alternative, achieving 80.9% CER reduction
The project chose **hyperparameter optimization** over fine-tuning because:
- Fine-tuning requires extensive labeled datasets specific to the domain
- Hyperparameter tuning can improve pretrained models without retraining
- GPU acceleration (RTX 3060) enables efficient exploration of hyperparameter space
### Main Results
### Main Results (GPU - Jan 2026)
| Model | CER | Character Accuracy |
|-------|-----|-------------------|
| PaddleOCR Baseline | 7.78% | 92.22% |
| PaddleOCR-HyperAdjust | **1.49%** | **98.51%** |
| PaddleOCR Baseline | 8.85% | 91.15% |
| PaddleOCR-HyperAdjust (full dataset) | **7.72%** | **92.28%** |
| PaddleOCR-HyperAdjust (best trial) | **0.79%** | **99.21%** |
**Goal achieved:** CER < 2% (target was < 2%, result is 1.49%)
**Goal status:** CER < 2% achieved in best trial (0.79%). Full dataset shows 12.8% improvement.
### Optimal Configuration Found
### Optimal Configuration Found (GPU)
```python
config_optimizada = {
"textline_orientation": True, # CRITICAL - reduces CER ~70%
"use_doc_orientation_classify": False,
"textline_orientation": True, # CRITICAL for complex layouts
"use_doc_orientation_classify": True, # Improves document orientation
"use_doc_unwarping": False,
"text_det_thresh": 0.4690,
"text_det_box_thresh": 0.5412,
"text_det_thresh": 0.0462, # -0.52 correlation with CER
"text_det_box_thresh": 0.4862,
"text_det_unclip_ratio": 0.0,
"text_rec_score_thresh": 0.6350,
"text_rec_score_thresh": 0.5658,
}
```
### Key Findings
1. `textline_orientation=True` is the most impactful parameter (reduces CER by 69.7%)
2. `text_det_thresh` has -0.52 correlation with CER; values < 0.1 cause catastrophic failures
3. Document correction modules (`use_doc_orientation_classify`, `use_doc_unwarping`) are unnecessary for digital PDFs
1. `textline_orientation=True` is critical for documents with mixed layouts
2. `use_doc_orientation_classify=True` improves document orientation detection in GPU config
3. `text_det_thresh` has -0.52 correlation with CER; values < 0.01 cause catastrophic failures
4. `use_doc_unwarping=False` is optimal for digital PDFs (unnecessary processing)
## Repository Structure
@@ -99,13 +101,18 @@ The template (`plantilla_individual.pdf`) requires **5 chapters**. The docs/ fil
## Important Data Files
### Results CSV Files
- `src/raytune_paddle_subproc_results_20251207_192320.csv` - 64 Ray Tune trials with configs and metrics (PRIMARY DATA SOURCE)
### Results CSV Files (GPU - PRIMARY)
- `src/results/raytune_paddle_results_20260119_122609.csv` - 64 Ray Tune trials PaddleOCR GPU (PRIMARY)
- `src/results/raytune_easyocr_results_20260119_120204.csv` - 64 Ray Tune trials EasyOCR GPU
- `src/results/raytune_doctr_results_20260119_121445.csv` - 64 Ray Tune trials DocTR GPU
### Key Notebooks
- `src/paddle_ocr_fine_tune_unir_raytune.ipynb` - Main Ray Tune experiment
- `src/prepare_dataset.ipynb` - PDF to image/text conversion
- `ocr_benchmark_notebook.ipynb` - EasyOCR vs PaddleOCR vs DocTR comparison
### Results CSV Files (CPU - time reference only)
- `src/raytune_paddle_subproc_results_20251207_192320.csv` - CPU execution for time comparison (69.4s/page vs 0.84s/page GPU)
### Key Scripts
- `src/run_tuning.py` - Main Ray Tune optimization script
- `src/raytune/raytune_ocr.py` - Ray Tune utilities and search spaces
- `src/paddle_ocr/paddle_ocr_tuning_rest.py` - PaddleOCR REST API
## Technical Stack
@@ -128,13 +135,13 @@ The template (`plantilla_individual.pdf`) requires **5 chapters**. The docs/ fil
### Priority Tasks
1. **Validate on other document types** - Test optimal config on invoices, forms, contracts
2. **Expand dataset** - Current dataset has only 24 pages
2. **Use larger tuning subset** - Current 5 pages caused overfitting; recommend 15-20 pages
3. **Create presentation slides** - For thesis defense
4. **Final document review** - Open in Word, update indices (Ctrl+A, F9), verify formatting
### Optional Extensions
- Explore `text_det_unclip_ratio` parameter (was fixed at 0.0)
- Compare with actual fine-tuning (if GPU access obtained)
- Compare with actual fine-tuning
- Multi-objective optimization (CER + WER + inference time)
## Thesis Document Generation