raytune as docker
Some checks failed
build_docker / essential (pull_request) Successful in 1s
build_docker / build_cpu (pull_request) Successful in 4m14s
build_docker / build_easyocr (pull_request) Successful in 12m19s
build_docker / build_easyocr_gpu (pull_request) Successful in 14m2s
build_docker / build_doctr (pull_request) Successful in 12m24s
build_docker / build_doctr_gpu (pull_request) Successful in 13m10s
build_docker / build_raytune (pull_request) Successful in 1m50s
build_docker / build_gpu (pull_request) Has been cancelled

This commit is contained in:
2026-01-19 16:32:45 +01:00
parent d67cbd4677
commit 94b25f9752
20 changed files with 7214 additions and 112 deletions

View File

@@ -25,6 +25,7 @@ jobs:
image_easyocr_gpu: seryus.ddns.net/unir/easyocr-gpu
image_doctr: seryus.ddns.net/unir/doctr-cpu
image_doctr_gpu: seryus.ddns.net/unir/doctr-gpu
image_raytune: seryus.ddns.net/unir/raytune
steps:
- name: Output version info
run: |
@@ -205,3 +206,32 @@ jobs:
tags: |
${{ needs.essential.outputs.image_doctr_gpu }}:${{ needs.essential.outputs.Version }}
${{ needs.essential.outputs.image_doctr_gpu }}:latest
# Ray Tune OCR image (amd64 only)
build_raytune:
runs-on: ubuntu-latest
needs: essential
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Login to Gitea Registry
uses: docker/login-action@v3
with:
registry: ${{ needs.essential.outputs.repo }}
username: username
password: ${{ secrets.CI_READWRITE }}
- name: Build and push Ray Tune image
uses: docker/build-push-action@v5
with:
context: src/raytune
file: src/raytune/Dockerfile
platforms: linux/amd64
push: true
tags: |
${{ needs.essential.outputs.image_raytune }}:${{ needs.essential.outputs.Version }}
${{ needs.essential.outputs.image_raytune }}:latest

View File

@@ -18,11 +18,15 @@ Optimizar el rendimiento de PaddleOCR para documentos académicos en español me
## Resultados Principales
**Tabla.** *Comparación de métricas OCR entre configuración baseline y optimizada.*
| Modelo | CER | Precisión Caracteres | WER | Precisión Palabras |
|--------|-----|---------------------|-----|-------------------|
| PaddleOCR (Baseline) | 7.78% | 92.22% | 14.94% | 85.06% |
| **PaddleOCR-HyperAdjust** | **1.49%** | **98.51%** | **7.62%** | **92.38%** |
*Fuente: Elaboración propia.*
**Mejora obtenida:** Reducción del CER en un **80.9%**
### Configuración Óptima Encontrada
@@ -56,6 +60,8 @@ PDF (académico UNIR)
### Experimento de Optimización
**Tabla.** *Parámetros de configuración del experimento Ray Tune.*
| Parámetro | Valor |
|-----------|-------|
| Número de trials | 64 |
@@ -64,6 +70,8 @@ PDF (académico UNIR)
| Trials concurrentes | 2 |
| Tiempo total | ~6 horas (CPU) |
*Fuente: Elaboración propia.*
---
## Estructura del Repositorio
@@ -143,16 +151,20 @@ Se realizó una validación adicional con aceleración GPU para evaluar la viabi
## Requisitos
**Tabla.** *Dependencias principales del proyecto y versiones utilizadas.*
| Componente | Versión |
|------------|---------|
| Python | 3.11.9 |
| Python | 3.12.3 |
| PaddlePaddle | 3.2.2 |
| PaddleOCR | 3.3.2 |
| Ray | 2.52.1 |
| Optuna | 4.6.0 |
| Optuna | 4.7.0 |
| jiwer | (para métricas CER/WER) |
| PyMuPDF | (para conversión PDF) |
*Fuente: Elaboración propia.*
---
## Uso
@@ -262,11 +274,15 @@ python3 apply_content.py
### Archivos de Entrada y Salida
**Tabla.** *Relación de scripts de generación con sus archivos de entrada y salida.*
| Script | Entrada | Salida |
|--------|---------|--------|
| `generate_mermaid_figures.py` | `docs/*.md` (bloques ```mermaid```) | `thesis_output/figures/figura_*.png`, `figures_manifest.json` |
| `apply_content.py` | `instructions/plantilla_individual.htm`, `docs/*.md`, `thesis_output/figures/*.png` | `thesis_output/plantilla_individual.htm` |
*Fuente: Elaboración propia.*
### Contenido Generado Automáticamente
- **30 tablas** con formato APA (Tabla X. *Título* + Fuente: ...)

View File

@@ -6,7 +6,8 @@ import os
from bs4 import BeautifulSoup, NavigableString
BASE_DIR = os.path.dirname(os.path.abspath(__file__))
TEMPLATE = os.path.join(BASE_DIR, 'thesis_output/plantilla_individual.htm')
TEMPLATE_INPUT = os.path.join(BASE_DIR, 'instructions/plantilla_individual.htm')
TEMPLATE_OUTPUT = os.path.join(BASE_DIR, 'thesis_output/plantilla_individual.htm')
DOCS_DIR = os.path.join(BASE_DIR, 'docs')
# Global counters for tables and figures
@@ -365,7 +366,7 @@ def main():
global table_counter, figure_counter
print("Reading template...")
html_content = read_file(TEMPLATE)
html_content = read_file(TEMPLATE_INPUT)
soup = BeautifulSoup(html_content, 'html.parser')
print("Reading docs content...")
@@ -595,9 +596,9 @@ def main():
print("Saving modified template...")
output_html = str(soup)
write_file(TEMPLATE, output_html)
write_file(TEMPLATE_OUTPUT, output_html)
print(f"✓ Done! Modified: {TEMPLATE}")
print(f"✓ Done! Modified: {TEMPLATE_OUTPUT}")
print("\nTo convert to DOCX:")
print("1. Open the .htm file in Microsoft Word")
print("2. Replace [Insertar diagrama Mermaid aquí] placeholders with actual diagrams")

View File

@@ -18,6 +18,8 @@ El procesamiento de documentos en español presenta particularidades que complic
La Tabla 1 resume los principales desafíos lingüísticos del OCR en español:
**Tabla 1.** *Desafíos lingüísticos específicos del OCR en español.*
| Desafío | Descripción | Impacto en OCR |
|---------|-------------|----------------|
| Caracteres especiales | ñ, á, é, í, ó, ú, ü, ¿, ¡ | Confusión con caracteres similares (n/ñ, a/á) |
@@ -25,7 +27,7 @@ La Tabla 1 resume los principales desafíos lingüísticos del OCR en español:
| Abreviaturas | Dr., Sra., Ud., etc. | Puntos internos confunden segmentación |
| Nombres propios | Tildes en apellidos (García, Martínez) | Bases de datos sin soporte Unicode |
*Tabla 1. Desafíos lingüísticos específicos del OCR en español. Fuente: Elaboración propia.*
*Fuente: Elaboración propia.*
Además de los aspectos lingüísticos, los documentos académicos y administrativos en español presentan características tipográficas que complican el reconocimiento: variaciones en fuentes entre encabezados, cuerpo y notas al pie; presencia de tablas con bordes y celdas; logotipos institucionales; marcas de agua; y elementos gráficos como firmas o sellos. Estos elementos generan ruido que puede propagarse en aplicaciones downstream como la extracción de entidades nombradas o el análisis semántico.
@@ -37,6 +39,8 @@ La adaptación de modelos preentrenados a dominios específicos típicamente req
La Tabla 2 ilustra los requisitos típicos para diferentes estrategias de mejora de OCR:
**Tabla 2.** *Comparación de estrategias de mejora de modelos OCR.*
| Estrategia | Datos requeridos | Hardware | Tiempo | Expertise |
|------------|------------------|----------|--------|-----------|
| Fine-tuning completo | >10,000 imágenes etiquetadas | GPU (≥16GB VRAM) | Días-Semanas | Alto |
@@ -44,7 +48,7 @@ La Tabla 2 ilustra los requisitos típicos para diferentes estrategias de mejora
| Transfer learning | >500 imágenes etiquetadas | GPU (≥8GB VRAM) | Horas | Medio |
| **Optimización de hiperparámetros** | **<100 imágenes de validación** | **CPU suficiente** | **Horas** | **Bajo-Medio** |
*Tabla 2. Comparación de estrategias de mejora de modelos OCR. Fuente: Elaboración propia.*
*Fuente: Elaboración propia.*
### La oportunidad: optimización sin fine-tuning
@@ -88,6 +92,8 @@ Una solución técnicamente superior pero impracticable tiene valor limitado. Es
Este trabajo se centra específicamente en:
**Tabla 3.** *Delimitación del alcance del trabajo.*
| Aspecto | Dentro del alcance | Fuera del alcance |
|---------|-------------------|-------------------|
| **Tipo de documento** | Documentos académicos digitales (PDF) | Documentos escaneados, manuscritos |
@@ -96,7 +102,7 @@ Este trabajo se centra específicamente en:
| **Método de mejora** | Optimización de hiperparámetros | Fine-tuning, aumento de datos |
| **Hardware** | Ejecución en CPU | Aceleración GPU |
*Tabla 3. Delimitación del alcance del trabajo. Fuente: Elaboración propia.*
*Fuente: Elaboración propia.*
### Relevancia y beneficiarios

View File

@@ -8,6 +8,8 @@ Este capítulo establece los objetivos del trabajo siguiendo la metodología SMA
### Justificación SMART del Objetivo General
**Tabla 4.** *Justificación SMART del objetivo general.*
| Criterio | Cumplimiento |
|----------|--------------|
| **Específico (S)** | Se define claramente qué se quiere lograr: optimizar PaddleOCR mediante ajuste de hiperparámetros para documentos en español |
@@ -16,6 +18,8 @@ Este capítulo establece los objetivos del trabajo siguiendo la metodología SMA
| **Relevante (R)** | El impacto es demostrable: mejora la extracción de texto en documentos académicos sin costes adicionales de infraestructura |
| **Temporal (T)** | El plazo es un cuatrimestre, correspondiente al TFM |
*Fuente: Elaboración propia.*
## Objetivos específicos
### OE1: Comparar soluciones OCR de código abierto
@@ -115,12 +119,16 @@ class ImageTextDataset:
#### Modelos Evaluados
**Tabla 5.** *Modelos OCR evaluados en el benchmark inicial.*
| Modelo | Versión | Configuración |
|--------|---------|---------------|
| EasyOCR | - | Idiomas: ['es', 'en'] |
| PaddleOCR | PP-OCRv5 | Modelos server_det + server_rec |
| DocTR | - | db_resnet50 + sar_resnet31 |
*Fuente: Elaboración propia.*
#### Métricas de Evaluación
Se utilizó la biblioteca `jiwer` para calcular:
@@ -139,6 +147,8 @@ def evaluate_text(reference, prediction):
#### Hiperparámetros Seleccionados
**Tabla 6.** *Hiperparámetros seleccionados para optimización.*
| Parámetro | Tipo | Rango/Valores | Descripción |
|-----------|------|---------------|-------------|
| `use_doc_orientation_classify` | Booleano | [True, False] | Clasificación de orientación del documento |
@@ -149,6 +159,8 @@ def evaluate_text(reference, prediction):
| `text_det_unclip_ratio` | Fijo | 0.0 | Coeficiente de expansión (fijado) |
| `text_rec_score_thresh` | Continuo | [0.0, 0.7] | Umbral de confianza de reconocimiento |
*Fuente: Elaboración propia.*
#### Configuración de Ray Tune
```python
@@ -235,23 +247,31 @@ Y retorna métricas en formato JSON:
#### Hardware
**Tabla 7.** *Especificaciones de hardware del entorno de desarrollo.*
| Componente | Especificación |
|------------|----------------|
| CPU | Intel Core (especificar modelo) |
| RAM | 16 GB |
| GPU | No disponible (ejecución en CPU) |
| CPU | AMD Ryzen 7 5800H |
| RAM | 16 GB DDR4 |
| GPU | NVIDIA RTX 3060 Laptop (5.66 GB VRAM) |
| Almacenamiento | SSD |
*Fuente: Elaboración propia.*
#### Software
**Tabla 8.** *Versiones de software utilizadas.*
| Componente | Versión |
|------------|---------|
| Sistema Operativo | Windows 10/11 |
| Python | 3.11.9 |
| Sistema Operativo | Ubuntu 24.04.3 LTS |
| Python | 3.12.3 |
| PaddleOCR | 3.3.2 |
| PaddlePaddle | 3.2.2 |
| Ray | 2.52.1 |
| Optuna | 4.6.0 |
| Optuna | 4.7.0 |
*Fuente: Elaboración propia.*
### Limitaciones Metodológicas

View File

@@ -34,6 +34,11 @@ Se seleccionaron tres soluciones OCR de código abierto representativas del esta
*Fuente: Elaboración propia.*
**Imágenes Docker disponibles en el registro del proyecto:**
- PaddleOCR: `seryus.ddns.net/unir/paddle-ocr-gpu`, `seryus.ddns.net/unir/paddle-ocr-cpu`
- EasyOCR: `seryus.ddns.net/unir/easyocr-gpu`
- DocTR: `seryus.ddns.net/unir/doctr-gpu`
### Criterios de Éxito
Los criterios establecidos para evaluar las soluciones fueron:
@@ -322,7 +327,7 @@ Esta sección ha presentado:
### Introducción
Esta sección describe el proceso de optimización de hiperparámetros de PaddleOCR utilizando Ray Tune con el algoritmo de búsqueda Optuna. Los experimentos fueron implementados en el notebook `src/paddle_ocr_fine_tune_unir_raytune.ipynb` y los resultados se almacenaron en `src/raytune_paddle_subproc_results_20251207_192320.csv`.
Esta sección describe el proceso de optimización de hiperparámetros de PaddleOCR utilizando Ray Tune con el algoritmo de búsqueda Optuna. Los experimentos fueron implementados en [`src/run_tuning.py`](https://github.com/seryus/MastersThesis/blob/main/src/run_tuning.py) con la librería de utilidades [`src/raytune_ocr.py`](https://github.com/seryus/MastersThesis/blob/main/src/raytune_ocr.py), y los resultados se almacenaron en [`src/results/`](https://github.com/seryus/MastersThesis/tree/main/src/results).
La optimización de hiperparámetros representa una alternativa al fine-tuning tradicional que no requiere:
- Acceso a GPU dedicada
@@ -339,17 +344,17 @@ El experimento se ejecutó en el siguiente entorno:
| Componente | Versión/Especificación |
|------------|------------------------|
| Sistema operativo | Windows 10/11 |
| Python | 3.11.9 |
| Sistema operativo | Ubuntu 24.04.3 LTS |
| Python | 3.12.3 |
| PaddlePaddle | 3.2.2 |
| PaddleOCR | 3.3.2 |
| Ray | 2.52.1 |
| Optuna | 4.6.0 |
| CPU | Intel Core (multinúcleo) |
| RAM | 16 GB |
| GPU | No disponible (ejecución CPU) |
| Optuna | 4.7.0 |
| CPU | AMD Ryzen 7 5800H |
| RAM | 16 GB DDR4 |
| GPU | NVIDIA RTX 3060 Laptop (5.66 GB VRAM) |
*Fuente: Outputs del notebook `src/paddle_ocr_fine_tune_unir_raytune.ipynb`.*
*Fuente: Configuración del entorno de ejecución. Resultados en `src/results/` generados por `src/run_tuning.py`.*
#### Arquitectura de Ejecución
@@ -613,7 +618,7 @@ Configuración óptima:
| text_det_unclip_ratio | 0.0 | 1.5 | -1.5 (fijado) |
| text_rec_score_thresh | **0.6350** | 0.5 | +0.135 |
*Fuente: Análisis del notebook.*
*Fuente: Análisis de [`src/results/`](https://github.com/seryus/MastersThesis/tree/main/src/results) generados por [`src/run_tuning.py`](https://github.com/seryus/MastersThesis/blob/main/src/run_tuning.py).*
#### Análisis de Correlación
@@ -628,7 +633,7 @@ Se calculó la correlación de Pearson entre los parámetros continuos y las mé
| `text_rec_score_thresh` | -0.161 | Correlación débil negativa |
| `text_det_unclip_ratio` | NaN | Varianza cero (valor fijo) |
*Fuente: Análisis del notebook.*
*Fuente: Análisis de [`src/results/`](https://github.com/seryus/MastersThesis/tree/main/src/results) generados por [`src/run_tuning.py`](https://github.com/seryus/MastersThesis/blob/main/src/run_tuning.py).*
**Tabla 24.** *Correlación de parámetros con WER.*
@@ -638,7 +643,7 @@ Se calculó la correlación de Pearson entre los parámetros continuos y las mé
| `text_det_box_thresh` | +0.227 | Correlación débil positiva |
| `text_rec_score_thresh` | -0.173 | Correlación débil negativa |
*Fuente: Análisis del notebook.*
*Fuente: Análisis de [`src/results/`](https://github.com/seryus/MastersThesis/tree/main/src/results) generados por [`src/run_tuning.py`](https://github.com/seryus/MastersThesis/blob/main/src/run_tuning.py).*
**Hallazgo clave**: El parámetro `text_det_thresh` muestra la correlación más fuerte (-0.52 con ambas métricas), indicando que valores más altos de este umbral tienden a reducir el error. Este umbral controla qué píxeles se consideran "texto" en el mapa de probabilidad del detector.
@@ -653,7 +658,7 @@ El parámetro booleano `textline_orientation` demostró tener el mayor impacto e
| True | 3.76% | 7.12% | 12.73% | 32 |
| False | 12.40% | 14.93% | 21.71% | 32 |
*Fuente: Análisis del notebook.*
*Fuente: Análisis de [`src/results/`](https://github.com/seryus/MastersThesis/tree/main/src/results) generados por [`src/run_tuning.py`](https://github.com/seryus/MastersThesis/blob/main/src/run_tuning.py).*
**Interpretación:**
@@ -741,7 +746,7 @@ optimized_config = {
| PaddleOCR (Baseline) | 7.78% | 92.22% | 14.94% | 85.06% |
| PaddleOCR-HyperAdjust | **1.49%** | **98.51%** | **7.62%** | **92.38%** |
*Fuente: Ejecución final en notebook `src/paddle_ocr_fine_tune_unir_raytune.ipynb`.*
*Fuente: Validación final. Código en [`src/run_tuning.py`](https://github.com/seryus/MastersThesis/blob/main/src/run_tuning.py), resultados en [`src/results/`](https://github.com/seryus/MastersThesis/tree/main/src/results).*
#### Métricas de Mejora
@@ -823,9 +828,9 @@ Esta sección ha presentado:
4. **Mejora final**: CER reducido de 7.78% a 1.49% (reducción del 80.9%)
**Fuentes de datos:**
- `src/paddle_ocr_fine_tune_unir_raytune.ipynb`: Código del experimento
- `src/raytune_paddle_subproc_results_20251207_192320.csv`: Resultados de 64 trials
- `src/paddle_ocr_tuning.py`: Script de evaluación
- [`src/run_tuning.py`](https://github.com/seryus/MastersThesis/blob/main/src/run_tuning.py): Script principal de optimización
- [`src/raytune_ocr.py`](https://github.com/seryus/MastersThesis/blob/main/src/raytune_ocr.py): Librería de utilidades Ray Tune
- [`src/results/`](https://github.com/seryus/MastersThesis/tree/main/src/results): Resultados CSV de los trials
## Discusión y análisis de resultados
@@ -1066,8 +1071,13 @@ Este capítulo ha presentado el desarrollo completo de la contribución:
**Resultado principal**: Se logró alcanzar el objetivo de CER < 2% mediante optimización de hiperparámetros, sin requerir fine-tuning ni recursos GPU.
**Fuentes de datos:**
- `src/raytune_paddle_subproc_results_20251207_192320.csv`: Resultados de 64 trials
- `src/paddle_ocr_fine_tune_unir_raytune.ipynb`: Notebook principal del experimento
- [`src/run_tuning.py`](https://github.com/seryus/MastersThesis/blob/main/src/run_tuning.py): Script principal de optimización
- [`src/results/`](https://github.com/seryus/MastersThesis/tree/main/src/results): Resultados CSV de los trials
**Imágenes Docker:**
- `seryus.ddns.net/unir/paddle-ocr-gpu`: PaddleOCR con soporte GPU
- `seryus.ddns.net/unir/easyocr-gpu`: EasyOCR con soporte GPU
- `seryus.ddns.net/unir/doctr-gpu`: DocTR con soporte GPU
### Validación con Aceleración GPU

View File

@@ -10,10 +10,14 @@ Este Trabajo Fin de Máster ha demostrado que es posible mejorar significativame
El objetivo principal del trabajo era alcanzar un CER inferior al 2% en documentos académicos en español. Los resultados obtenidos confirman el cumplimiento de este objetivo:
**Tabla 39.** *Cumplimiento del objetivo de CER.*
| Métrica | Objetivo | Resultado |
|---------|----------|-----------|
| CER | < 2% | **1.49%** |
*Fuente: Elaboración propia.*
### Conclusiones Específicas
**Respecto a OE1 (Comparativa de soluciones OCR)**:

View File

@@ -48,6 +48,8 @@ MastersThesis/
### Sistema de Desarrollo
**Tabla A1.** *Especificaciones del sistema de desarrollo.*
| Componente | Especificación |
|------------|----------------|
| Sistema Operativo | Ubuntu 24.04.3 LTS |
@@ -56,20 +58,30 @@ MastersThesis/
| GPU | NVIDIA RTX 3060 Laptop (5.66 GB VRAM) |
| CUDA | 12.4 |
*Fuente: Elaboración propia.*
### Dependencias
**Tabla A2.** *Dependencias del proyecto.*
| Componente | Versión |
|------------|---------|
| Python | 3.11 |
| Docker | 24+ |
| Python | 3.12.3 |
| Docker | 29.1.5 |
| NVIDIA Container Toolkit | Requerido para GPU |
| Ray | 2.52+ |
| Optuna | 4.6+ |
| Ray | 2.52.1 |
| Optuna | 4.7.0 |
*Fuente: Elaboración propia.*
## A.4 Instrucciones de Ejecución de Servicios OCR
### PaddleOCR (Puerto 8002)
**Imágenes Docker:**
- GPU: `seryus.ddns.net/unir/paddle-ocr-gpu`
- CPU: `seryus.ddns.net/unir/paddle-ocr-cpu`
```bash
cd src/paddle_ocr
@@ -82,6 +94,8 @@ docker compose -f docker-compose.cpu-registry.yml up -d
### DocTR (Puerto 8003)
**Imagen Docker:** `seryus.ddns.net/unir/doctr-gpu`
```bash
cd src/doctr_service
@@ -91,6 +105,8 @@ docker compose up -d
### EasyOCR (Puerto 8002)
**Imagen Docker:** `seryus.ddns.net/unir/easyocr-gpu`
```bash
cd src/easyocr_service
@@ -165,29 +181,37 @@ analyze_results(results, prefix='raytune_paddle', config_keys=PADDLE_OCR_CONFIG_
### Servicios y Puertos
**Tabla A3.** *Servicios Docker y puertos.*
| Servicio | Puerto | Script de Ajuste |
|----------|--------|------------------|
| PaddleOCR | 8002 | `paddle_ocr_payload` |
| DocTR | 8003 | `doctr_payload` |
| EasyOCR | 8002 | `easyocr_payload` |
*Fuente: Elaboración propia.*
## A.7 Métricas de Rendimiento
Los resultados detallados de las evaluaciones y ajustes de hiperparámetros se encuentran en:
- [Métricas Generales](metrics/metrics.md) - Comparativa de los tres servicios
- [PaddleOCR](metrics/metrics_paddle.md) - Mejor precisión (7.72% CER)
- [PaddleOCR](metrics/metrics_paddle.md) - Mejor precisión (7.76% CER baseline, **1.49% optimizado**)
- [DocTR](metrics/metrics_doctr.md) - Más rápido (0.50s/página)
- [EasyOCR](metrics/metrics_easyocr.md) - Balance intermedio
### Resumen de Resultados
**Tabla A4.** *Resumen de resultados del benchmark por servicio.*
| Servicio | CER Base | CER Ajustado | Mejora |
|----------|----------|--------------|--------|
| **PaddleOCR** | 8.85% | **7.72%** | 12.8% |
| DocTR | 12.06% | 12.07% | 0% |
| EasyOCR | 11.23% | 11.14% | 0.8% |
*Fuente: Elaboración propia.*
## A.8 Licencia
El código se distribuye bajo licencia MIT.

View File

@@ -1,74 +1,153 @@
# Running Notebooks in Background
## Quick: Check Ray Tune Progress
```bash
# Is papermill still running?
ps aux | grep papermill | grep -v grep
# View live log
tail -f papermill.log
# Find latest Ray Tune run and count completed trials
LATEST=$(ls -td ~/ray_results/trainable_* 2>/dev/null | head -1)
echo "Run: $LATEST"
COMPLETED=$(find "$LATEST" -name "result.json" -size +0 2>/dev/null | wc -l)
TOTAL=$(ls -d "$LATEST"/trainable_*/ 2>/dev/null | wc -l)
echo "Completed: $COMPLETED / $TOTAL"
# Check workers are healthy
for port in 8001 8002 8003; do
status=$(curl -s "localhost:$port/health" 2>/dev/null | python3 -c "import sys,json; print(json.load(sys.stdin).get('status','down'))" 2>/dev/null || echo "down")
echo "Worker $port: $status"
done
# Show best result so far
if [ "$COMPLETED" -gt 0 ]; then
find "$LATEST" -name "result.json" -size +0 -exec cat {} \; 2>/dev/null | \
python3 -c "import sys,json; results=[json.loads(l) for l in sys.stdin if l.strip()]; best=min(results,key=lambda x:x.get('CER',999)); print(f'Best CER: {best[\"CER\"]:.4f}, WER: {best[\"WER\"]:.4f}')" 2>/dev/null
fi
```
---
## Option 1: Papermill (Recommended)
Runs notebooks directly without conversion.
```bash
pip install papermill
nohup papermill <notebook>.ipynb output.ipynb > papermill.log 2>&1 &
```
Monitor:
```bash
tail -f papermill.log
```
## Option 2: Convert to Python Script
```bash
jupyter nbconvert --to script <notebook>.ipynb
nohup python <notebook>.py > output.log 2>&1 &
```
**Note:** `%pip install` magic commands need manual removal before running as `.py`
## Important Notes
- Ray Tune notebooks require the OCR service running first (Docker)
- For Ray workers, imports must be inside trainable functions
## Example: Ray Tune PaddleOCR
```bash
# 1. Start OCR service
cd src/paddle_ocr && docker compose up -d ocr-cpu
# 2. Run notebook with papermill
cd src
nohup papermill paddle_ocr_raytune_rest.ipynb output_raytune.ipynb > papermill.log 2>&1 &
# 3. Monitor
tail -f papermill.log
```
# OCR Hyperparameter Tuning with Ray Tune
This directory contains the Docker setup for running automated hyperparameter optimization on OCR services using Ray Tune with Optuna.
## Prerequisites
- Docker with NVIDIA GPU support (`nvidia-container-toolkit`)
- NVIDIA GPU with CUDA support
## Quick Start
```bash
cd src
# Start PaddleOCR service and run tuning (images pulled from registry)
docker compose -f docker-compose.tuning.paddle.yml up -d paddle-ocr-gpu
docker compose -f docker-compose.tuning.paddle.yml run raytune --service paddle --samples 64
```
## Available Services
| Service | Port | Compose File |
|---------|------|--------------|
| PaddleOCR | 8002 | `docker-compose.tuning.paddle.yml` |
| DocTR | 8003 | `docker-compose.tuning.doctr.yml` |
| EasyOCR | 8002 | `docker-compose.tuning.easyocr.yml` |
**Note:** PaddleOCR and EasyOCR both use port 8002. Run them separately.
## Usage Examples
### PaddleOCR Tuning
```bash
# Start service
docker compose -f docker-compose.tuning.paddle.yml up -d paddle-ocr-gpu
# Wait for health check (check with)
curl http://localhost:8002/health
# Run tuning (64 samples)
docker compose -f docker-compose.tuning.paddle.yml run raytune --service paddle --samples 64
# Stop service
docker compose -f docker-compose.tuning.paddle.yml down
```
### DocTR Tuning
```bash
docker compose -f docker-compose.tuning.doctr.yml up -d doctr-gpu
curl http://localhost:8003/health
docker compose -f docker-compose.tuning.doctr.yml run raytune --service doctr --samples 64
docker compose -f docker-compose.tuning.doctr.yml down
```
### EasyOCR Tuning
```bash
docker compose -f docker-compose.tuning.easyocr.yml up -d easyocr-gpu
curl http://localhost:8002/health
docker compose -f docker-compose.tuning.easyocr.yml run raytune --service easyocr --samples 64
docker compose -f docker-compose.tuning.easyocr.yml down
```
### Run Multiple Services (PaddleOCR + DocTR)
```bash
# Start both services
docker compose -f docker-compose.tuning.yml up -d paddle-ocr-gpu doctr-gpu
# Run tuning for each
docker compose -f docker-compose.tuning.yml run raytune --service paddle --samples 64
docker compose -f docker-compose.tuning.yml run raytune --service doctr --samples 64
# Stop all
docker compose -f docker-compose.tuning.yml down
```
## Command Line Options
```bash
docker compose -f <compose-file> run raytune --service <service> --samples <n>
```
| Option | Description | Default |
|--------|-------------|---------|
| `--service` | OCR service: `paddle`, `doctr`, `easyocr` | Required |
| `--samples` | Number of hyperparameter trials | 64 |
## Output
Results are saved to `src/results/` as CSV files:
- `raytune_paddle_results_<timestamp>.csv`
- `raytune_doctr_results_<timestamp>.csv`
- `raytune_easyocr_results_<timestamp>.csv`
## Directory Structure
```
src/
├── docker-compose.tuning.yml # All services (PaddleOCR + DocTR)
├── docker-compose.tuning.paddle.yml # PaddleOCR only
├── docker-compose.tuning.doctr.yml # DocTR only
├── docker-compose.tuning.easyocr.yml # EasyOCR only
├── raytune/
│ ├── Dockerfile
│ ├── requirements.txt
│ ├── raytune_ocr.py
│ └── run_tuning.py
├── dataset/ # Input images and ground truth
├── results/ # Output CSV files
└── debugset/ # Debug output
```
## Docker Images
All images are pre-built and pulled from registry:
- `seryus.ddns.net/unir/raytune:latest` - Ray Tune tuning service
- `seryus.ddns.net/unir/paddle-ocr-gpu:latest` - PaddleOCR GPU
- `seryus.ddns.net/unir/doctr-gpu:latest` - DocTR GPU
- `seryus.ddns.net/unir/easyocr-gpu:latest` - EasyOCR GPU
### Build locally (development)
```bash
# Build raytune image locally
docker build -t seryus.ddns.net/unir/raytune:latest ./raytune
```
## Troubleshooting
### Service not ready
Wait for the health check to pass before running tuning:
```bash
# Check service health
curl http://localhost:8002/health
# Expected: {"status": "ok", "model_loaded": true, ...}
```
### GPU not detected
Ensure `nvidia-container-toolkit` is installed:
```bash
nvidia-smi # Should show your GPU
docker run --rm --gpus all nvidia/cuda:12.4.1-base nvidia-smi
```
### Port already in use
Stop any running OCR services:
```bash
docker compose -f docker-compose.tuning.paddle.yml down
docker compose -f docker-compose.tuning.easyocr.yml down
```

View File

@@ -0,0 +1,50 @@
# docker-compose.tuning.doctr.yml - Ray Tune with DocTR GPU
# Usage:
# docker compose -f docker-compose.tuning.doctr.yml up -d doctr-gpu
# docker compose -f docker-compose.tuning.doctr.yml run raytune --service doctr --samples 64
# docker compose -f docker-compose.tuning.doctr.yml down
services:
raytune:
image: seryus.ddns.net/unir/raytune:latest
command: ["--service", "doctr", "--host", "doctr-gpu", "--port", "8000", "--samples", "64"]
volumes:
- ./results:/app/results:rw
environment:
- PYTHONUNBUFFERED=1
depends_on:
doctr-gpu:
condition: service_healthy
doctr-gpu:
image: seryus.ddns.net/unir/doctr-gpu:latest
container_name: doctr-gpu-tuning
ports:
- "8003:8000"
volumes:
- ./dataset:/app/dataset:ro
- ./debugset:/app/debugset:rw
- doctr-cache:/root/.cache/doctr
environment:
- PYTHONUNBUFFERED=1
- CUDA_VISIBLE_DEVICES=0
- DOCTR_DET_ARCH=db_resnet50
- DOCTR_RECO_ARCH=crnn_vgg16_bn
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
restart: unless-stopped
healthcheck:
test: ["CMD", "python", "-c", "import urllib.request; urllib.request.urlopen('http://localhost:8000/health')"]
interval: 30s
timeout: 10s
retries: 3
start_period: 180s
volumes:
doctr-cache:
name: doctr-model-cache

View File

@@ -0,0 +1,51 @@
# docker-compose.tuning.easyocr.yml - Ray Tune with EasyOCR GPU
# Usage:
# docker compose -f docker-compose.tuning.easyocr.yml up -d easyocr-gpu
# docker compose -f docker-compose.tuning.easyocr.yml run raytune --service easyocr --samples 64
# docker compose -f docker-compose.tuning.easyocr.yml down
#
# Note: EasyOCR uses port 8002 (same as PaddleOCR). Cannot run simultaneously.
services:
raytune:
image: seryus.ddns.net/unir/raytune:latest
command: ["--service", "easyocr", "--host", "easyocr-gpu", "--port", "8000", "--samples", "64"]
volumes:
- ./results:/app/results:rw
environment:
- PYTHONUNBUFFERED=1
depends_on:
easyocr-gpu:
condition: service_healthy
easyocr-gpu:
image: seryus.ddns.net/unir/easyocr-gpu:latest
container_name: easyocr-gpu-tuning
ports:
- "8002:8000"
volumes:
- ./dataset:/app/dataset:ro
- ./debugset:/app/debugset:rw
- easyocr-cache:/root/.EasyOCR
environment:
- PYTHONUNBUFFERED=1
- CUDA_VISIBLE_DEVICES=0
- EASYOCR_LANGUAGES=es,en
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
restart: unless-stopped
healthcheck:
test: ["CMD", "python", "-c", "import urllib.request; urllib.request.urlopen('http://localhost:8000/health')"]
interval: 30s
timeout: 10s
retries: 3
start_period: 120s
volumes:
easyocr-cache:
name: easyocr-model-cache

View File

@@ -0,0 +1,50 @@
# docker-compose.tuning.paddle.yml - Ray Tune with PaddleOCR GPU
# Usage:
# docker compose -f docker-compose.tuning.paddle.yml up -d paddle-ocr-gpu
# docker compose -f docker-compose.tuning.paddle.yml run raytune --service paddle --samples 64
# docker compose -f docker-compose.tuning.paddle.yml down
services:
raytune:
image: seryus.ddns.net/unir/raytune:latest
command: ["--service", "paddle", "--host", "paddle-ocr-gpu", "--port", "8000", "--samples", "64"]
volumes:
- ./results:/app/results:rw
environment:
- PYTHONUNBUFFERED=1
depends_on:
paddle-ocr-gpu:
condition: service_healthy
paddle-ocr-gpu:
image: seryus.ddns.net/unir/paddle-ocr-gpu:latest
container_name: paddle-ocr-gpu-tuning
ports:
- "8002:8000"
volumes:
- ./dataset:/app/dataset:ro
- ./debugset:/app/debugset:rw
- paddlex-cache:/root/.paddlex
environment:
- PYTHONUNBUFFERED=1
- CUDA_VISIBLE_DEVICES=0
- PADDLE_DET_MODEL=PP-OCRv5_mobile_det
- PADDLE_REC_MODEL=PP-OCRv5_mobile_rec
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
restart: unless-stopped
healthcheck:
test: ["CMD", "python", "-c", "import urllib.request; urllib.request.urlopen('http://localhost:8000/health')"]
interval: 30s
timeout: 10s
retries: 3
start_period: 60s
volumes:
paddlex-cache:
name: paddlex-model-cache

View File

@@ -0,0 +1,82 @@
# docker-compose.tuning.yml - Ray Tune with all OCR services (PaddleOCR + DocTR)
# Usage:
# docker compose -f docker-compose.tuning.yml up -d paddle-ocr-gpu doctr-gpu
# docker compose -f docker-compose.tuning.yml run raytune --service paddle --samples 64
# docker compose -f docker-compose.tuning.yml run raytune --service doctr --samples 64
# docker compose -f docker-compose.tuning.yml down
#
# Note: EasyOCR uses port 8002 (same as PaddleOCR). Use docker-compose.tuning.easyocr.yml separately.
services:
raytune:
image: seryus.ddns.net/unir/raytune:latest
network_mode: host
shm_size: '5gb'
volumes:
- ./results:/app/results:rw
environment:
- PYTHONUNBUFFERED=1
paddle-ocr-gpu:
image: seryus.ddns.net/unir/paddle-ocr-gpu:latest
container_name: paddle-ocr-gpu-tuning
ports:
- "8002:8000"
volumes:
- ./dataset:/app/dataset:ro
- ./debugset:/app/debugset:rw
- paddlex-cache:/root/.paddlex
environment:
- PYTHONUNBUFFERED=1
- CUDA_VISIBLE_DEVICES=0
- PADDLE_DET_MODEL=PP-OCRv5_mobile_det
- PADDLE_REC_MODEL=PP-OCRv5_mobile_rec
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
restart: unless-stopped
healthcheck:
test: ["CMD", "python", "-c", "import urllib.request; urllib.request.urlopen('http://localhost:8000/health')"]
interval: 30s
timeout: 10s
retries: 3
start_period: 60s
doctr-gpu:
image: seryus.ddns.net/unir/doctr-gpu:latest
container_name: doctr-gpu-tuning
ports:
- "8003:8000"
volumes:
- ./dataset:/app/dataset:ro
- ./debugset:/app/debugset:rw
- doctr-cache:/root/.cache/doctr
environment:
- PYTHONUNBUFFERED=1
- CUDA_VISIBLE_DEVICES=0
- DOCTR_DET_ARCH=db_resnet50
- DOCTR_RECO_ARCH=crnn_vgg16_bn
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
restart: unless-stopped
healthcheck:
test: ["CMD", "python", "-c", "import urllib.request; urllib.request.urlopen('http://localhost:8000/health')"]
interval: 30s
timeout: 10s
retries: 3
start_period: 180s
volumes:
paddlex-cache:
name: paddlex-model-cache
doctr-cache:
name: doctr-model-cache

18
src/raytune/Dockerfile Normal file
View File

@@ -0,0 +1,18 @@
FROM python:3.12-slim
WORKDIR /app
# Install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy application files
COPY raytune_ocr.py .
COPY run_tuning.py .
# Create results directory
RUN mkdir -p /app/results
ENV PYTHONUNBUFFERED=1
ENTRYPOINT ["python", "run_tuning.py"]

131
src/raytune/README.md Normal file
View File

@@ -0,0 +1,131 @@
# Ray Tune OCR Hyperparameter Optimization
Docker-based hyperparameter tuning for OCR services using Ray Tune with Optuna search.
## Structure
```
raytune/
├── Dockerfile # Python 3.12-slim with Ray Tune + Optuna
├── requirements.txt # Dependencies
├── raytune_ocr.py # Shared utilities and search spaces
├── run_tuning.py # CLI entry point
└── README.md
```
## Quick Start
```bash
cd src
# Build the raytune image
docker compose -f docker-compose.tuning.paddle.yml build raytune
# Or pull from registry
docker pull seryus.ddns.net/unir/raytune:latest
```
## Usage
### PaddleOCR Tuning
```bash
# Start PaddleOCR service
docker compose -f docker-compose.tuning.paddle.yml up -d paddle-ocr-gpu
# Wait for health check, then run tuning
docker compose -f docker-compose.tuning.paddle.yml run raytune --service paddle --samples 64
# Stop when done
docker compose -f docker-compose.tuning.paddle.yml down
```
### DocTR Tuning
```bash
docker compose -f docker-compose.tuning.doctr.yml up -d doctr-gpu
docker compose -f docker-compose.tuning.doctr.yml run raytune --service doctr --samples 64
docker compose -f docker-compose.tuning.doctr.yml down
```
### EasyOCR Tuning
```bash
# Note: EasyOCR uses port 8002 (same as PaddleOCR). Cannot run simultaneously.
docker compose -f docker-compose.tuning.easyocr.yml up -d easyocr-gpu
docker compose -f docker-compose.tuning.easyocr.yml run raytune --service easyocr --samples 64
docker compose -f docker-compose.tuning.easyocr.yml down
```
## CLI Options
```
python run_tuning.py --service {paddle,doctr,easyocr} --samples N
```
| Option | Description | Default |
|------------|--------------------------------------|---------|
| --service | OCR service to tune (required) | - |
| --samples | Number of hyperparameter trials | 64 |
## Search Spaces
### PaddleOCR
- `use_doc_orientation_classify`: [True, False]
- `use_doc_unwarping`: [True, False]
- `textline_orientation`: [True, False]
- `text_det_thresh`: uniform(0.0, 0.7)
- `text_det_box_thresh`: uniform(0.0, 0.7)
- `text_rec_score_thresh`: uniform(0.0, 0.7)
### DocTR
- `assume_straight_pages`: [True, False]
- `straighten_pages`: [True, False]
- `preserve_aspect_ratio`: [True, False]
- `symmetric_pad`: [True, False]
- `disable_page_orientation`: [True, False]
- `disable_crop_orientation`: [True, False]
- `resolve_lines`: [True, False]
- `resolve_blocks`: [True, False]
- `paragraph_break`: uniform(0.01, 0.1)
### EasyOCR
- `text_threshold`: uniform(0.3, 0.9)
- `low_text`: uniform(0.2, 0.6)
- `link_threshold`: uniform(0.2, 0.6)
- `slope_ths`: uniform(0.0, 0.3)
- `ycenter_ths`: uniform(0.3, 1.0)
- `height_ths`: uniform(0.3, 1.0)
- `width_ths`: uniform(0.3, 1.0)
- `add_margin`: uniform(0.0, 0.3)
- `contrast_ths`: uniform(0.05, 0.3)
- `adjust_contrast`: uniform(0.3, 0.8)
- `decoder`: ["greedy", "beamsearch"]
- `beamWidth`: [3, 5, 7, 10]
- `min_size`: [5, 10, 15, 20]
## Output
Results are saved to `src/results/` as CSV files:
- `raytune_paddle_results_YYYYMMDD_HHMMSS.csv`
- `raytune_doctr_results_YYYYMMDD_HHMMSS.csv`
- `raytune_easyocr_results_YYYYMMDD_HHMMSS.csv`
Each row contains:
- Configuration parameters (prefixed with `config/`)
- Metrics: CER, WER, TIME, PAGES, TIME_PER_PAGE
- Worker URL used for the trial
## Network Mode
The raytune container uses `network_mode: host` to access OCR services on localhost ports:
- PaddleOCR: port 8002
- DocTR: port 8003
- EasyOCR: port 8002 (conflicts with PaddleOCR)
## Dependencies
- ray[tune]==2.52.1
- optuna==4.7.0
- requests>=2.28.0
- pandas>=2.0.0

371
src/raytune/raytune_ocr.py Normal file
View File

@@ -0,0 +1,371 @@
# raytune_ocr.py
# Shared Ray Tune utilities for OCR hyperparameter optimization
#
# Usage:
# from raytune_ocr import check_workers, create_trainable, run_tuner, analyze_results
#
# Environment variables:
# OCR_HOST: Host for OCR services (default: localhost)
import os
from datetime import datetime
from typing import List, Dict, Any, Callable
import requests
import pandas as pd
import ray
from ray import tune
from ray.tune.search.optuna import OptunaSearch
def check_workers(
ports: List[int],
service_name: str = "OCR",
timeout: int = 180,
interval: int = 5,
) -> List[str]:
"""
Wait for workers to be fully ready (model + dataset loaded) and return healthy URLs.
Args:
ports: List of port numbers to check
service_name: Name for error messages
timeout: Max seconds to wait for each worker
interval: Seconds between retries
Returns:
List of healthy worker URLs
Raises:
RuntimeError if no healthy workers found after timeout
"""
import time
host = os.environ.get("OCR_HOST", "localhost")
worker_urls = [f"http://{host}:{port}" for port in ports]
healthy_workers = []
for url in worker_urls:
print(f"Waiting for {url}...")
start = time.time()
while time.time() - start < timeout:
try:
health = requests.get(f"{url}/health", timeout=10).json()
model_ok = health.get('model_loaded', False)
dataset_ok = health.get('dataset_loaded', False)
if health.get('status') == 'ok' and model_ok:
gpu = health.get('gpu_name', 'CPU')
print(f"{url}: ready ({gpu})")
healthy_workers.append(url)
break
elapsed = int(time.time() - start)
print(f" [{elapsed}s] model={model_ok}")
except requests.exceptions.RequestException:
elapsed = int(time.time() - start)
print(f" [{elapsed}s] not reachable")
time.sleep(interval)
else:
print(f"{url}: timeout after {timeout}s")
if not healthy_workers:
raise RuntimeError(
f"No healthy {service_name} workers found.\n"
f"Checked ports: {ports}"
)
print(f"\n{len(healthy_workers)}/{len(worker_urls)} workers ready\n")
return healthy_workers
def create_trainable(ports: List[int], payload_fn: Callable[[Dict], Dict]) -> Callable:
"""
Factory to create a trainable function for Ray Tune.
Args:
ports: List of worker ports for load balancing
payload_fn: Function that takes config dict and returns API payload dict
Returns:
Trainable function for Ray Tune
Note:
Ray Tune 2.x API: tune.report(metrics_dict) - pass dict directly, NOT kwargs.
See: https://docs.ray.io/en/latest/tune/api/doc/ray.tune.report.html
"""
def trainable(config):
import os
import random
import requests
from ray.tune import report # Ray 2.x: report(dict), not report(**kwargs)
host = os.environ.get("OCR_HOST", "localhost")
api_url = f"http://{host}:{random.choice(ports)}"
payload = payload_fn(config)
try:
response = requests.post(f"{api_url}/evaluate", json=payload, timeout=None)
response.raise_for_status()
metrics = response.json()
metrics["worker"] = api_url
report(metrics) # Ray 2.x API: pass dict directly
except Exception as e:
report({ # Ray 2.x API: pass dict directly
"CER": 1.0,
"WER": 1.0,
"TIME": 0.0,
"PAGES": 0,
"TIME_PER_PAGE": 0,
"worker": api_url,
"ERROR": str(e)[:500]
})
return trainable
def run_tuner(
trainable: Callable,
search_space: Dict[str, Any],
num_samples: int = 64,
num_workers: int = 1,
metric: str = "CER",
mode: str = "min",
) -> tune.ResultGrid:
"""
Initialize Ray and run hyperparameter tuning.
Args:
trainable: Trainable function from create_trainable()
search_space: Dict of parameter names to tune.* search spaces
num_samples: Number of trials to run
num_workers: Max concurrent trials
metric: Metric to optimize
mode: "min" or "max"
Returns:
Ray Tune ResultGrid
"""
ray.init(
ignore_reinit_error=True,
include_dashboard=False,
configure_logging=False,
_metrics_export_port=0, # Disable metrics export to avoid connection warnings
)
print(f"Ray Tune ready (version: {ray.__version__})")
tuner = tune.Tuner(
trainable,
tune_config=tune.TuneConfig(
metric=metric,
mode=mode,
search_alg=OptunaSearch(),
num_samples=num_samples,
max_concurrent_trials=num_workers,
),
param_space=search_space,
)
return tuner.fit()
def analyze_results(
results: tune.ResultGrid,
output_folder: str = "results",
prefix: str = "raytune",
config_keys: List[str] = None,
) -> pd.DataFrame:
"""
Analyze and save tuning results.
Args:
results: Ray Tune ResultGrid
output_folder: Directory to save CSV
prefix: Filename prefix
config_keys: List of config keys to show in best result (without 'config/' prefix)
Returns:
Results DataFrame
"""
os.makedirs(output_folder, exist_ok=True)
df = results.get_dataframe()
# Save to CSV
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
filename = f"{prefix}_results_{timestamp}.csv"
filepath = os.path.join(output_folder, filename)
df.to_csv(filepath, index=False)
print(f"Results saved: {filepath}")
# Best configuration
best = df.loc[df["CER"].idxmin()]
print(f"\nBest CER: {best['CER']:.6f}")
print(f"Best WER: {best['WER']:.6f}")
if config_keys:
print(f"\nOptimal Configuration:")
for key in config_keys:
col = f"config/{key}"
if col in best:
val = best[col]
if isinstance(val, float):
print(f" {key}: {val:.4f}")
else:
print(f" {key}: {val}")
return df
def correlation_analysis(df: pd.DataFrame, param_keys: List[str]) -> None:
"""
Print correlation of numeric parameters with CER/WER.
Args:
df: Results DataFrame
param_keys: List of config keys (without 'config/' prefix)
"""
param_cols = [f"config/{k}" for k in param_keys if f"config/{k}" in df.columns]
numeric_cols = [c for c in param_cols if df[c].dtype in ['float64', 'int64']]
if not numeric_cols:
print("No numeric parameters for correlation analysis")
return
corr_cer = df[numeric_cols + ["CER"]].corr()["CER"].sort_values(ascending=False)
corr_wer = df[numeric_cols + ["WER"]].corr()["WER"].sort_values(ascending=False)
print("Correlation with CER:")
print(corr_cer)
print("\nCorrelation with WER:")
print(corr_wer)
# =============================================================================
# OCR-specific payload functions
# =============================================================================
def paddle_ocr_payload(config: Dict) -> Dict:
"""Create payload for PaddleOCR API. Uses pages 5-10 (first doc) for tuning."""
return {
"pdf_folder": "/app/dataset",
"use_doc_orientation_classify": config.get("use_doc_orientation_classify", False),
"use_doc_unwarping": config.get("use_doc_unwarping", False),
"textline_orientation": config.get("textline_orientation", True),
"text_det_thresh": config.get("text_det_thresh", 0.0),
"text_det_box_thresh": config.get("text_det_box_thresh", 0.0),
"text_det_unclip_ratio": config.get("text_det_unclip_ratio", 1.5),
"text_rec_score_thresh": config.get("text_rec_score_thresh", 0.0),
"start_page": 5,
"end_page": 10,
"save_output": False,
}
def doctr_payload(config: Dict) -> Dict:
"""Create payload for DocTR API. Uses pages 5-10 (first doc) for tuning."""
return {
"pdf_folder": "/app/dataset",
"assume_straight_pages": config.get("assume_straight_pages", True),
"straighten_pages": config.get("straighten_pages", False),
"preserve_aspect_ratio": config.get("preserve_aspect_ratio", True),
"symmetric_pad": config.get("symmetric_pad", True),
"disable_page_orientation": config.get("disable_page_orientation", False),
"disable_crop_orientation": config.get("disable_crop_orientation", False),
"resolve_lines": config.get("resolve_lines", True),
"resolve_blocks": config.get("resolve_blocks", False),
"paragraph_break": config.get("paragraph_break", 0.035),
"start_page": 5,
"end_page": 10,
"save_output": False,
}
def easyocr_payload(config: Dict) -> Dict:
"""Create payload for EasyOCR API. Uses pages 5-10 (first doc) for tuning."""
return {
"pdf_folder": "/app/dataset",
"text_threshold": config.get("text_threshold", 0.7),
"low_text": config.get("low_text", 0.4),
"link_threshold": config.get("link_threshold", 0.4),
"slope_ths": config.get("slope_ths", 0.1),
"ycenter_ths": config.get("ycenter_ths", 0.5),
"height_ths": config.get("height_ths", 0.5),
"width_ths": config.get("width_ths", 0.5),
"add_margin": config.get("add_margin", 0.1),
"contrast_ths": config.get("contrast_ths", 0.1),
"adjust_contrast": config.get("adjust_contrast", 0.5),
"decoder": config.get("decoder", "greedy"),
"beamWidth": config.get("beamWidth", 5),
"min_size": config.get("min_size", 10),
"start_page": 5,
"end_page": 10,
"save_output": False,
}
# =============================================================================
# Search spaces
# =============================================================================
PADDLE_OCR_SEARCH_SPACE = {
"use_doc_orientation_classify": tune.choice([True, False]),
"use_doc_unwarping": tune.choice([True, False]),
"textline_orientation": tune.choice([True, False]),
"text_det_thresh": tune.uniform(0.0, 0.7),
"text_det_box_thresh": tune.uniform(0.0, 0.7),
"text_det_unclip_ratio": tune.choice([0.0]),
"text_rec_score_thresh": tune.uniform(0.0, 0.7),
}
DOCTR_SEARCH_SPACE = {
"assume_straight_pages": tune.choice([True, False]),
"straighten_pages": tune.choice([True, False]),
"preserve_aspect_ratio": tune.choice([True, False]),
"symmetric_pad": tune.choice([True, False]),
"disable_page_orientation": tune.choice([True, False]),
"disable_crop_orientation": tune.choice([True, False]),
"resolve_lines": tune.choice([True, False]),
"resolve_blocks": tune.choice([True, False]),
"paragraph_break": tune.uniform(0.01, 0.1),
}
EASYOCR_SEARCH_SPACE = {
"text_threshold": tune.uniform(0.3, 0.9),
"low_text": tune.uniform(0.2, 0.6),
"link_threshold": tune.uniform(0.2, 0.6),
"slope_ths": tune.uniform(0.0, 0.3),
"ycenter_ths": tune.uniform(0.3, 1.0),
"height_ths": tune.uniform(0.3, 1.0),
"width_ths": tune.uniform(0.3, 1.0),
"add_margin": tune.uniform(0.0, 0.3),
"contrast_ths": tune.uniform(0.05, 0.3),
"adjust_contrast": tune.uniform(0.3, 0.8),
"decoder": tune.choice(["greedy", "beamsearch"]),
"beamWidth": tune.choice([3, 5, 7, 10]),
"min_size": tune.choice([5, 10, 15, 20]),
}
# =============================================================================
# Config keys for results display
# =============================================================================
PADDLE_OCR_CONFIG_KEYS = [
"use_doc_orientation_classify", "use_doc_unwarping", "textline_orientation",
"text_det_thresh", "text_det_box_thresh", "text_det_unclip_ratio", "text_rec_score_thresh",
]
DOCTR_CONFIG_KEYS = [
"assume_straight_pages", "straighten_pages", "preserve_aspect_ratio", "symmetric_pad",
"disable_page_orientation", "disable_crop_orientation", "resolve_lines", "resolve_blocks",
"paragraph_break",
]
EASYOCR_CONFIG_KEYS = [
"text_threshold", "low_text", "link_threshold", "slope_ths", "ycenter_ths",
"height_ths", "width_ths", "add_margin", "contrast_ths", "adjust_contrast",
"decoder", "beamWidth", "min_size",
]

View File

@@ -0,0 +1,4 @@
ray[tune]==2.52.1
optuna==4.7.0
requests>=2.28.0
pandas>=2.0.0

80
src/raytune/run_tuning.py Normal file
View File

@@ -0,0 +1,80 @@
#!/usr/bin/env python3
"""Run hyperparameter tuning for OCR services."""
import os
import sys
import argparse
from raytune_ocr import (
check_workers, create_trainable, run_tuner, analyze_results,
paddle_ocr_payload, doctr_payload, easyocr_payload,
PADDLE_OCR_SEARCH_SPACE, DOCTR_SEARCH_SPACE, EASYOCR_SEARCH_SPACE,
PADDLE_OCR_CONFIG_KEYS, DOCTR_CONFIG_KEYS, EASYOCR_CONFIG_KEYS,
)
SERVICES = {
"paddle": {
"payload_fn": paddle_ocr_payload,
"search_space": PADDLE_OCR_SEARCH_SPACE,
"config_keys": PADDLE_OCR_CONFIG_KEYS,
"name": "PaddleOCR",
},
"doctr": {
"payload_fn": doctr_payload,
"search_space": DOCTR_SEARCH_SPACE,
"config_keys": DOCTR_CONFIG_KEYS,
"name": "DocTR",
},
"easyocr": {
"payload_fn": easyocr_payload,
"search_space": EASYOCR_SEARCH_SPACE,
"config_keys": EASYOCR_CONFIG_KEYS,
"name": "EasyOCR",
},
}
def main():
parser = argparse.ArgumentParser(description="Run OCR hyperparameter tuning")
parser.add_argument("--service", choices=["paddle", "doctr", "easyocr"], required=True)
parser.add_argument("--host", type=str, default="localhost", help="OCR service host")
parser.add_argument("--port", type=int, default=8000, help="OCR service port")
parser.add_argument("--samples", type=int, default=64, help="Number of samples")
args = parser.parse_args()
# Set environment variable for raytune_ocr module
os.environ["OCR_HOST"] = args.host
cfg = SERVICES[args.service]
ports = [args.port]
print(f"\n{'='*50}")
print(f"Hyperparameter Tuning: {cfg['name']}")
print(f"Host: {args.host}:{args.port}")
print(f"Samples: {args.samples}")
print(f"{'='*50}\n")
# Check workers
healthy = check_workers(ports, cfg["name"])
# Create trainable and run tuning
trainable = create_trainable(ports, cfg["payload_fn"])
results = run_tuner(
trainable=trainable,
search_space=cfg["search_space"],
num_samples=args.samples,
num_workers=len(healthy),
)
# Analyze results
df = analyze_results(
results,
output_folder="results",
prefix=f"raytune_{args.service}",
config_keys=cfg["config_keys"],
)
print(f"\n{'='*50}")
print("Tuning complete!")
print(f"{'='*50}")
if __name__ == "__main__":
main()

File diff suppressed because it is too large Load Diff