All checks were successful
build_docker / essential (push) Successful in 0s
build_docker / build_paddle_ocr (push) Successful in 5m28s
build_docker / build_paddle_ocr_gpu (push) Successful in 21m16s
build_docker / build_easyocr (push) Successful in 15m52s
build_docker / build_easyocr_gpu (push) Successful in 18m22s
build_docker / build_doctr (push) Successful in 19m3s
build_docker / build_raytune (push) Successful in 3m34s
build_docker / build_doctr_gpu (push) Successful in 13m56s
551 lines
22 KiB
Markdown
551 lines
22 KiB
Markdown
# Claude Code Context - Masters Thesis OCR Project
|
|
|
|
## Project Overview
|
|
|
|
This is a **Master's Thesis (TFM)** for UNIR's Master in Artificial Intelligence. The project focuses on **OCR hyperparameter optimization** using Ray Tune with Optuna for Spanish academic documents.
|
|
|
|
**Author:** Sergio Jiménez Jiménez
|
|
**University:** UNIR (Universidad Internacional de La Rioja)
|
|
**Year:** 2025
|
|
|
|
## Key Context
|
|
|
|
### Why Hyperparameter Optimization Instead of Fine-tuning
|
|
|
|
The project chose **hyperparameter optimization** over fine-tuning because:
|
|
- Fine-tuning requires extensive labeled datasets specific to the domain
|
|
- Hyperparameter tuning can improve pretrained models without retraining
|
|
- GPU acceleration (RTX 3060) enables efficient exploration of hyperparameter space
|
|
|
|
### Main Results (GPU - Jan 2026)
|
|
|
|
| Model | CER | Character Accuracy |
|
|
|-------|-----|-------------------|
|
|
| PaddleOCR Baseline | 8.85% | 91.15% |
|
|
| PaddleOCR-HyperAdjust (full dataset) | **7.72%** | **92.28%** |
|
|
| PaddleOCR-HyperAdjust (best trial) | **0.79%** | **99.21%** |
|
|
|
|
**Goal status:** CER < 2% achieved in best trial (0.79%). Full dataset shows 12.8% improvement.
|
|
|
|
### Optimal Configuration Found (GPU)
|
|
|
|
```python
|
|
config_optimizada = {
|
|
"textline_orientation": True, # CRITICAL for complex layouts
|
|
"use_doc_orientation_classify": True, # Improves document orientation
|
|
"use_doc_unwarping": False,
|
|
"text_det_thresh": 0.0462, # -0.52 correlation with CER
|
|
"text_det_box_thresh": 0.4862,
|
|
"text_det_unclip_ratio": 0.0,
|
|
"text_rec_score_thresh": 0.5658,
|
|
}
|
|
```
|
|
|
|
### Key Findings
|
|
|
|
1. `textline_orientation=True` is critical for documents with mixed layouts
|
|
2. `use_doc_orientation_classify=True` improves document orientation detection in GPU config
|
|
3. `text_det_thresh` has -0.52 correlation with CER; values < 0.01 cause catastrophic failures
|
|
4. `use_doc_unwarping=False` is optimal for digital PDFs (unnecessary processing)
|
|
|
|
## Repository Structure
|
|
|
|
```
|
|
MastersThesis/
|
|
├── docs/ # Thesis chapters in Markdown (UNIR template structure)
|
|
│ ├── 00_resumen.md # Resumen + Abstract + Keywords
|
|
│ ├── 01_introduccion.md # 1. Introducción (1.1, 1.2, 1.3)
|
|
│ ├── 02_contexto_estado_arte.md # 2. Contexto y estado del arte (2.1, 2.2, 2.3)
|
|
│ ├── 03_objetivos_metodologia.md # 3. Objetivos y metodología (3.1, 3.2, 3.3, 3.4)
|
|
│ ├── 04_desarrollo_especifico.md # 4. Desarrollo específico (4.1, 4.2, 4.3)
|
|
│ ├── 05_conclusiones_trabajo_futuro.md # 5. Conclusiones (5.1, 5.2)
|
|
│ ├── 06_referencias_bibliograficas.md # Referencias bibliográficas (APA format)
|
|
│ └── 07_anexo_a.md # Anexo A: Código fuente y datos
|
|
├── thesis_output/ # Generated thesis document
|
|
│ ├── plantilla_individual.htm # Complete TFM (open in Word)
|
|
│ └── figures/ # PNG figures from Mermaid diagrams
|
|
│ ├── figura_1.png ... figura_7.png
|
|
│ └── figures_manifest.json
|
|
├── src/
|
|
│ ├── paddle_ocr_fine_tune_unir_raytune.ipynb # Main experiment (64 trials)
|
|
│ ├── paddle_ocr_tuning.py # CLI evaluation script
|
|
│ ├── dataset_manager.py # ImageTextDataset class
|
|
│ ├── prepare_dataset.ipynb # Dataset preparation
|
|
│ └── raytune_paddle_subproc_results_20251207_192320.csv # 64 trial results
|
|
├── results/ # Benchmark results CSVs
|
|
├── instructions/ # UNIR instructions and template
|
|
│ ├── instrucciones.pdf # TFE writing guidelines
|
|
│ ├── plantilla_individual.pdf # Word template (PDF version)
|
|
│ └── plantilla_individual.htm # Word template (HTML version, source)
|
|
├── apply_content.py # Generates TFM document from docs/ + template
|
|
├── generate_mermaid_figures.py # Converts Mermaid diagrams to PNG
|
|
├── ocr_benchmark_notebook.ipynb # Initial OCR benchmark
|
|
└── README.md
|
|
```
|
|
|
|
### docs/ to Template Mapping
|
|
|
|
The template (`plantilla_individual.pdf`) requires **5 chapters**. The docs/ files now match this structure exactly:
|
|
|
|
| Template Section | docs/ File | Notes |
|
|
|-----------------|------------|-------|
|
|
| Resumen | `00_resumen.md` (Spanish part) | 150-300 words + Palabras clave |
|
|
| Abstract | `00_resumen.md` (English part) | 150-300 words + Keywords |
|
|
| 1. Introducción | `01_introduccion.md` | Subsections 1.1, 1.2, 1.3 |
|
|
| 2. Contexto y estado del arte | `02_contexto_estado_arte.md` | Subsections 2.1, 2.2, 2.3 + Mermaid diagrams |
|
|
| 3. Objetivos y metodología | `03_objetivos_metodologia.md` | Subsections 3.1, 3.2, 3.3, 3.4 + Mermaid diagrams |
|
|
| 4. Desarrollo específico | `04_desarrollo_especifico.md` | Subsections 4.1, 4.2, 4.3 + Mermaid charts |
|
|
| 5. Conclusiones y trabajo futuro | `05_conclusiones_trabajo_futuro.md` | Subsections 5.1, 5.2 |
|
|
| Referencias bibliográficas | `06_referencias_bibliograficas.md` | APA, alphabetical |
|
|
| Anexo A | `07_anexo_a.md` | Repository URL + structure |
|
|
|
|
## Important Data Files
|
|
|
|
### Results CSV Files (GPU - PRIMARY)
|
|
- `src/results/raytune_paddle_results_20260119_122609.csv` - 64 Ray Tune trials PaddleOCR GPU (PRIMARY)
|
|
- `src/results/raytune_easyocr_results_20260119_120204.csv` - 64 Ray Tune trials EasyOCR GPU
|
|
- `src/results/raytune_doctr_results_20260119_121445.csv` - 64 Ray Tune trials DocTR GPU
|
|
|
|
### Results CSV Files (CPU - time reference only)
|
|
- `src/raytune_paddle_subproc_results_20251207_192320.csv` - CPU execution for time comparison (69.4s/page vs 0.84s/page GPU)
|
|
|
|
### Key Scripts
|
|
- `src/run_tuning.py` - Main Ray Tune optimization script
|
|
- `src/raytune/raytune_ocr.py` - Ray Tune utilities and search spaces
|
|
- `src/paddle_ocr/paddle_ocr_tuning_rest.py` - PaddleOCR REST API
|
|
|
|
## Technical Stack
|
|
|
|
| Component | Version |
|
|
|-----------|---------|
|
|
| Python | 3.11.9 |
|
|
| PaddlePaddle | 3.2.2 |
|
|
| PaddleOCR | 3.3.2 |
|
|
| Ray | 2.52.1 |
|
|
| Optuna | 4.6.0 |
|
|
|
|
## Pending Work
|
|
|
|
### Completed Tasks
|
|
- [x] **Structure docs/ to match UNIR template** - All chapters now follow exact numbering (1.1, 1.2, etc.)
|
|
- [x] **Add Mermaid diagrams** - 7 diagrams added (OCR pipeline, Ray Tune architecture, methodology flowcharts, CER comparison charts)
|
|
- [x] **Generate unified thesis document** - `apply_content.py` generates complete document from docs/
|
|
- [x] **Convert Mermaid to PNG** - `generate_mermaid_figures.py` generates figures automatically
|
|
- [x] **Proper template formatting** - Tables/figures use `Piedefoto-tabla` class, references use `MsoBibliography`
|
|
|
|
### Priority Tasks
|
|
1. **Validate on other document types** - Test optimal config on invoices, forms, contracts
|
|
2. **Use larger tuning subset** - Current 5 pages caused overfitting; recommend 15-20 pages
|
|
3. **Create presentation slides** - For thesis defense
|
|
4. **Final document review** - Open in Word, update indices (Ctrl+A, F9), verify formatting
|
|
|
|
### Optional Extensions
|
|
- Explore `text_det_unclip_ratio` parameter (was fixed at 0.0)
|
|
- Compare with actual fine-tuning
|
|
- Multi-objective optimization (CER + WER + inference time)
|
|
|
|
## Thesis Document Generation
|
|
|
|
To regenerate the thesis document:
|
|
|
|
```bash
|
|
# 1. Generate PNG figures from Mermaid diagrams
|
|
python3 generate_mermaid_figures.py
|
|
|
|
# 2. Apply docs/ content to UNIR template
|
|
python3 apply_content.py
|
|
|
|
# 3. Open in Word and finalize
|
|
# - Open thesis_output/plantilla_individual.htm in Microsoft Word
|
|
# - Press Ctrl+A then F9 to update all indices
|
|
# - Save as .docx
|
|
```
|
|
|
|
**What `apply_content.py` does:**
|
|
- Replaces Resumen and Abstract with actual content + keywords
|
|
- Replaces all 5 chapters with content from docs/
|
|
- Replaces Referencias with APA-formatted bibliography
|
|
- Replaces Anexo with repository information
|
|
- Converts Mermaid diagrams to embedded PNG images
|
|
- Formats tables with `Piedefoto-tabla` captions and sources
|
|
- Removes template instruction text ("Importante:", "Ejemplo de nota al pie", etc.)
|
|
|
|
---
|
|
|
|
## UNIR TFE Document Guidelines
|
|
|
|
**CRITICAL:** The thesis MUST follow UNIR's official template (`instructions/plantilla_individual.pdf`) and guidelines (`instructions/instrucciones.pdf`).
|
|
|
|
### Work Type Classification
|
|
|
|
This thesis is a **hybrid of Type 1 (Piloto experimental) and Type 3 (Comparativa de soluciones)**:
|
|
- Comparative study of OCR solutions (EasyOCR, PaddleOCR, DocTR)
|
|
- Experimental pilot with Ray Tune hyperparameter optimization
|
|
- 64 trials executed, results analyzed statistically
|
|
|
|
### Document Structure (from plantilla_individual.pdf - MANDATORY)
|
|
|
|
The TFE must follow this EXACT structure from the official template:
|
|
|
|
| Section | Subsections | Notes |
|
|
|---------|-------------|-------|
|
|
| **Portada** | Title, Author, Type, Director, Date | Use template format exactly |
|
|
| **Resumen** | 150-300 words + 3-5 Palabras clave | Spanish summary |
|
|
| **Abstract** | 150-300 words + 3-5 Keywords | English summary |
|
|
| **Índice de contenidos** | Auto-generated | New page |
|
|
| **Índice de figuras** | Auto-generated | New page |
|
|
| **Índice de tablas** | Auto-generated | New page |
|
|
| **1. Introducción** | 1.1 Motivación, 1.2 Planteamiento del trabajo, 1.3 Estructura del trabajo | 3-5 pages |
|
|
| **2. Contexto y estado del arte** | 2.1 Contexto del problema, 2.2 Estado del arte, 2.3 Conclusiones | 10-15 pages |
|
|
| **3. Objetivos concretos y metodología** | 3.1 Objetivo general, 3.2 Objetivos específicos, 3.3 Metodología del trabajo | Variable |
|
|
| **4. Desarrollo específico** | Varies by work type (see below) | Main content |
|
|
| **5. Conclusiones y trabajo futuro** | 5.1 Conclusiones, 5.2 Líneas de trabajo futuro | Variable |
|
|
| **Referencias bibliográficas** | APA format, alphabetical, hanging indent | Variable |
|
|
| **Anexo A** | Código fuente y datos analizados | Repository URL |
|
|
|
|
**Total length:** 50-90 pages (excluding cover, resumen, abstract, indices, annexes)
|
|
|
|
### Chapter-Specific Requirements (from plantilla_individual.pdf)
|
|
|
|
#### 1. Introducción
|
|
The introduction must give a clear first idea of what was intended, the conclusions reached, and the procedure followed. Key ideas: problem identification, justification of importance, general objectives, preview of contribution.
|
|
|
|
**1.1 Motivación:**
|
|
- Present the problem to solve
|
|
- Justify importance to educational/scientific community
|
|
- Answer: What problem? What are the causes? Why is it relevant?
|
|
- Must include references to prior research
|
|
|
|
**1.2 Planteamiento del trabajo:**
|
|
- Briefly state the problem/need detected
|
|
- Describe the proposal and purpose
|
|
- Answer: How to solve? What is proposed?
|
|
|
|
**1.3 Estructura del trabajo:**
|
|
- Briefly describe what each subsequent chapter contains
|
|
|
|
#### 2. Contexto y estado del arte
|
|
Study the application domain in depth, citing numerous references. Must consult different sources (not just online - also technical manuals, books).
|
|
|
|
**2.1 Contexto del problema:**
|
|
- Deep study of the application domain
|
|
|
|
**2.2 Estado del arte:**
|
|
- Antecedents, current studies, comparison of existing tools
|
|
- Must reference key authors in the field (justify exclusions)
|
|
|
|
**2.3 Conclusiones:**
|
|
- Summary linking research to the work to be done
|
|
- How findings affect the specific development
|
|
|
|
#### 3. Objetivos concretos y metodología de trabajo
|
|
Bridge between domain study and contribution. Three required elements: (1) general objective, (2) specific objectives, (3) methodology.
|
|
|
|
**3.1 Objetivo general:**
|
|
- Must be SMART (Doran, 1981)
|
|
- Focus on achieving an observable effect, not just "create a tool"
|
|
- Example: "Mejorar el servicio X logrando Y valorado positivamente (mínimo 4/5) por Z"
|
|
|
|
**3.2 Objetivos específicos:**
|
|
- Divide general objective into analyzable sub-objectives
|
|
- Must be SMART
|
|
- Use infinitive verbs: Analizar, Calcular, Clasificar, Comparar, Conocer, Cuantificar, Desarrollar, Describir, Descubrir, Determinar, Establecer, Explorar, Identificar, Indagar, Medir, Sintetizar, Verificar
|
|
- Typically ~5 objectives: 1-2 about state of art, 2-3 about development
|
|
|
|
**3.3 Metodología del trabajo:**
|
|
- Describe steps to achieve objectives
|
|
- Explain WHY each step
|
|
- What instruments will be used
|
|
- How results will be analyzed
|
|
|
|
#### 4. Desarrollo específico de la contribución
|
|
Structure depends on work type. Organize by methodology phases/activities.
|
|
|
|
**For Type 1 (Piloto experimental):**
|
|
- 4.1 Descripción detallada del experimento
|
|
- Technologies used (with justification)
|
|
- How pilot was organized
|
|
- Participants (demographics)
|
|
- Automatic evaluation techniques
|
|
- How experiment proceeded
|
|
- Monitoring/evaluation instruments
|
|
- Statistical analysis types
|
|
- 4.2 Descripción de los resultados (objective, no interpretation)
|
|
- Summary tables, result graphs, relevant data identification
|
|
- 4.3 Discusión
|
|
- Relevance of results, explanations for anomalies, highlight key findings
|
|
|
|
**For Type 3 (Comparativa de soluciones):**
|
|
- 4.1 Planteamiento de la comparativa
|
|
- Problem identification, alternative solutions to evaluate
|
|
- Success criteria, measures to take
|
|
- 4.2 Desarrollo de la comparativa
|
|
- All results and measurements obtained
|
|
- Graphs, tables, data visualization
|
|
- 4.3 Discusión y análisis de resultados
|
|
- Discussion of meaning, advantages/disadvantages of solutions
|
|
|
|
#### 5. Conclusiones y trabajo futuro
|
|
|
|
**5.1 Conclusiones:**
|
|
- Summary of problem, approach, and why solution is valid
|
|
- Summary of contributions
|
|
- **Relate contributions and results to objectives** - discuss degree of achievement
|
|
|
|
**5.2 Líneas de trabajo futuro:**
|
|
- Future work that would add value
|
|
- Justify how contribution can be used and in what fields
|
|
|
|
### SMART Objectives Requirements
|
|
|
|
ALL objectives (general and specific) MUST be SMART:
|
|
|
|
| Criterion | Requirement | Example from this thesis |
|
|
|-----------|-------------|-------------------------|
|
|
| **S**pecific | Clearly define what to achieve | "Optimizar PaddleOCR para documentos en español" |
|
|
| **M**easurable | Quantifiable success metric | "CER < 2%" |
|
|
| **A**ttainable | Feasible with available resources | "Sin GPU, usando optimización de hiperparámetros" |
|
|
| **R**elevant | Demonstrable impact | "Mejora extracción de texto en documentos académicos" |
|
|
| **T**ime-bound | Achievable in timeframe | "Un cuatrimestre" |
|
|
|
|
### Citation and Reference Rules
|
|
|
|
#### APA Format is MANDATORY
|
|
|
|
Reference guide: https://bibliografiaycitas.unir.net/
|
|
|
|
**In-text citations:**
|
|
- Single author: (Du, 2020) or Du (2020)
|
|
- Two authors: (Du & Li, 2020)
|
|
- Three+ authors: (Du et al., 2020)
|
|
|
|
**Reference list examples:**
|
|
```
|
|
# Journal article with DOI
|
|
Shi, B., Bai, X., & Yao, C. (2016). An end-to-end trainable neural network
|
|
for image-based sequence recognition. IEEE Transactions on Pattern
|
|
Analysis and Machine Intelligence, 39(11), 2298-2304.
|
|
https://doi.org/10.1109/TPAMI.2016.2646371
|
|
|
|
# Conference paper
|
|
Akiba, T., Sano, S., Yanase, T., Ohta, T., & Koyama, M. (2019). Optuna:
|
|
A next-generation hyperparameter optimization framework. Proceedings
|
|
of the 25th ACM SIGKDD, 2623-2631.
|
|
https://doi.org/10.1145/3292500.3330701
|
|
|
|
# arXiv preprint
|
|
Du, Y., Li, C., Guo, R., ... & Wang, H. (2020). PP-OCR: A practical ultra
|
|
lightweight OCR system. arXiv preprint arXiv:2009.09941.
|
|
https://arxiv.org/abs/2009.09941
|
|
|
|
# Software/GitHub repository
|
|
PaddlePaddle. (2024). PaddleOCR: Awesome multilingual OCR toolkits based
|
|
on PaddlePaddle. GitHub. https://github.com/PaddlePaddle/PaddleOCR
|
|
|
|
# Book
|
|
Cohen, J. (1988). Statistical power analysis for the behavioral sciences
|
|
(2nd ed.). Lawrence Erlbaum Associates.
|
|
```
|
|
|
|
#### Reference Rules
|
|
- **NO Wikipedia citations**
|
|
- Include variety: books, conferences, journal articles (not just URLs)
|
|
- All cited references must appear in reference list
|
|
- All references in list must be cited in text
|
|
- Order alphabetically by first author's surname
|
|
- Include DOI or URL when available
|
|
|
|
### Document Formatting Rules
|
|
|
|
#### Page Setup
|
|
| Element | Specification |
|
|
|---------|--------------|
|
|
| Page size | A4 |
|
|
| Left margin | 3.0 cm |
|
|
| Right margin | 2.0 cm |
|
|
| Top/Bottom margins | 2.5 cm |
|
|
| Header | Student name + TFE title |
|
|
| Footer | Page number |
|
|
|
|
#### Typography
|
|
| Element | Format |
|
|
|---------|--------|
|
|
| Body text | Calibri 12, justified, 1.5 line spacing, 6pt before/after |
|
|
| Título 1 | Calibri Light 18, blue, justified, 1.5 spacing |
|
|
| Título 2 | Calibri Light 14, blue, justified, 1.5 spacing |
|
|
| Título 3 | Calibri Light 12, justified, 1.5 spacing |
|
|
| Footnotes | Calibri 10, justified, single spacing |
|
|
| Code | Can reduce to 9pt if needed |
|
|
|
|
#### Tables and Figures (from plantilla_individual.pdf)
|
|
|
|
**Table format example:**
|
|
```
|
|
Tabla 1. Ejemplo de tabla con sus principales elementos.
|
|
[TABLE CONTENT]
|
|
Fuente: American Psychological Association, 2020a.
|
|
```
|
|
|
|
**Figure format example:**
|
|
```
|
|
Figura 1. Ejemplo de figura realizada para nuestro trabajo.
|
|
[FIGURE]
|
|
Fuente: American Psychological Association, 2020b.
|
|
```
|
|
|
|
**Rules:**
|
|
- **Title position**: Above the table/figure
|
|
- **Numbering format**: "**Tabla 1.**" / "**Figura 1.**" (Calibri 12, bold)
|
|
- **Title text**: Calibri 12, italic (after the number)
|
|
- **Source**: Below, centered, format "Fuente: Author, Year."
|
|
- Can reduce font to 9pt for dense tables
|
|
- Can use landscape orientation for large tables
|
|
- Tables should have horizontal lines only (no vertical lines) per APA style
|
|
|
|
### Writing Style Rules
|
|
|
|
#### MUST DO:
|
|
- Each chapter starts with introductory paragraph explaining content
|
|
- Each paragraph has at least 3 sentences
|
|
- Verify originality (cite all sources)
|
|
- Check spelling with Word corrector
|
|
- Ensure logical flow between paragraphs
|
|
- Define concepts and include pertinent citations
|
|
|
|
#### MUST NOT DO:
|
|
- Two consecutive headings without text between them
|
|
- Superfluous phrases and repetition of ideas
|
|
- Short paragraphs (less than 3 sentences)
|
|
- Missing figure/table numbers or titles
|
|
- Broken index generation
|
|
|
|
### Annexes Requirements
|
|
|
|
**Anexo A - Código fuente y datos:**
|
|
- Include repository URL where code is hosted
|
|
- Student must be sole author and owner of repository
|
|
- No commits from other users
|
|
- Data used should also be in repository
|
|
- If confidential (company project), justify why not shared
|
|
|
|
### Final Submission
|
|
|
|
- **Drafts**: Submit in Word format
|
|
- **Final deposit**: Submit in PDF format
|
|
- Verify all indices generate correctly before final submission
|
|
|
|
---
|
|
|
|
## Guidelines for Claude
|
|
|
|
### CRITICAL: Academic Rigor Requirements
|
|
|
|
**This is a Master's Thesis. Academic rigor is NON-NEGOTIABLE.**
|
|
|
|
#### DO NOT:
|
|
- **NEVER fabricate data or statistics** - Every number must come from an actual file in this repository
|
|
- **NEVER invent comparison results** - If we don't have data for EasyOCR or DocTR comparisons, don't make up numbers
|
|
- **NEVER assume or estimate values** - If a metric isn't in the CSV/notebook, don't include it
|
|
- **NEVER extrapolate beyond what the data shows** - 24 pages is a limited dataset, acknowledge this
|
|
- **NEVER claim results that weren't measured** - Only report what was actually computed
|
|
|
|
#### ALWAYS:
|
|
- **Read the source file first** before citing any result
|
|
- **Quote exact values** from CSV files (e.g., CER 0.011535 not "approximately 1%")
|
|
- **Reference the specific file and location** for every data point
|
|
- **Acknowledge limitations** explicitly (dataset size, CPU-only, single document type)
|
|
- **Distinguish between measured results and interpretations**
|
|
|
|
#### Data Sources (ONLY use these):
|
|
| Data Type | Source File |
|
|
|-----------|-------------|
|
|
| Ray Tune 64 trials | `src/raytune_paddle_subproc_results_20251207_192320.csv` |
|
|
| Experiment code | `src/paddle_ocr_fine_tune_unir_raytune.ipynb` |
|
|
| Final comparison | Output cells in the notebook (baseline vs optimized) |
|
|
|
|
#### Example of WRONG vs RIGHT:
|
|
|
|
**WRONG:** "EasyOCR achieved 8.5% CER while PaddleOCR achieved 5.2% CER"
|
|
(We don't have this comparison data in our results files)
|
|
|
|
**RIGHT:** "The optimization reduced CER from 7.78% to 1.49%, a reduction of 80.9% (source: final comparison in `paddle_ocr_fine_tune_unir_raytune.ipynb`)"
|
|
|
|
**WRONG:** "The optimization improved results by approximately 80%"
|
|
|
|
**RIGHT:** "From the 64 trials in `raytune_paddle_subproc_results_20251207_192320.csv`, minimum CER achieved was 1.15%"
|
|
|
|
### When Working on Documentation
|
|
|
|
1. **Read UNIR guidelines first**: Check `instructions/instrucciones.pdf` for structure requirements
|
|
|
|
2. **Follow chapter structure**: Each chapter has specific content requirements per UNIR guidelines
|
|
|
|
3. **References are UNIFIED**: All references go in `docs/06_referencias_bibliograficas.md`, NOT per-chapter
|
|
|
|
4. **Use APA format**: All citations must follow APA style
|
|
|
|
5. **Include "Fuentes de datos"**: Each chapter should list which repository files the data came from
|
|
|
|
6. **Language**: Documentation is in Spanish (thesis requirement), code comments in English
|
|
|
|
7. **Hardware context**: Remember this is CPU-only execution. Any suggestions about GPU training should acknowledge this limitation
|
|
|
|
8. **When in doubt, ask**: If the user requests data that doesn't exist, ask rather than inventing numbers
|
|
|
|
9. **DIAGRAMS MUST BE IN MERMAID FORMAT**: All diagrams, flowcharts, and visualizations in the documentation MUST use Mermaid syntax. This ensures:
|
|
- Version control friendly (text-based)
|
|
- Consistent styling across all chapters
|
|
- Easy to edit and maintain
|
|
- Renders properly in GitHub and most Markdown viewers
|
|
|
|
**Supported Mermaid diagram types:**
|
|
- `flowchart` / `graph` - For pipelines, workflows, architectures
|
|
- `xychart-beta` - For bar charts, comparisons
|
|
- `sequenceDiagram` - For process interactions
|
|
- `classDiagram` - For class structures
|
|
- `stateDiagram` - For state machines
|
|
- `pie` - For proportional data
|
|
|
|
**Example:**
|
|
```mermaid
|
|
flowchart LR
|
|
A[Input] --> B[Process] --> C[Output]
|
|
```
|
|
|
|
### Common Tasks
|
|
|
|
- **Adding new experiments**: Update `src/paddle_ocr_fine_tune_unir_raytune.ipynb`
|
|
- **Updating documentation**: Edit files in `docs/`
|
|
- **Adding references**: Add to `docs/06_referencias_bibliograficas.md` (unified list)
|
|
- **Dataset expansion**: Use `src/prepare_dataset.ipynb` as template
|
|
- **Running evaluations**: Use `src/paddle_ocr_tuning.py` CLI
|
|
|
|
---
|
|
|
|
## Experiment Details
|
|
|
|
### Ray Tune Configuration
|
|
```python
|
|
tuner = tune.Tuner(
|
|
trainable_paddle_ocr,
|
|
tune_config=tune.TuneConfig(
|
|
metric="CER",
|
|
mode="min",
|
|
search_alg=OptunaSearch(),
|
|
num_samples=64,
|
|
max_concurrent_trials=2
|
|
)
|
|
)
|
|
```
|
|
|
|
### Dataset
|
|
- Source: UNIR TFE instructions PDF
|
|
- Pages: 24
|
|
- Resolution: 300 DPI
|
|
- Ground truth: Extracted via PyMuPDF
|
|
|
|
### Metrics
|
|
- CER (Character Error Rate) - Primary metric
|
|
- WER (Word Error Rate) - Secondary metric
|
|
- Calculated using `jiwer` library
|