deliberable_16_12_2025

2025-12-16 00:53:27 +01:00
parent 6d6bebfed9
commit 57df34ac5a
88 changed files with 17836 additions and 1467 deletions
--- a/claude.md
+++ b/claude.md
@@ -0,0 +1,543 @@
+# Claude Code Context - Masters Thesis OCR Project
+
+## Project Overview
+
+This is a **Master's Thesis (TFM)** for UNIR's Master in Artificial Intelligence. The project focuses on **OCR hyperparameter optimization** using Ray Tune with Optuna for Spanish academic documents.
+
+**Author:** Sergio Jiménez Jiménez
+**University:** UNIR (Universidad Internacional de La Rioja)
+**Year:** 2025
+
+## Key Context
+
+### Why Hyperparameter Optimization Instead of Fine-tuning
+
+Due to **hardware limitations** (no dedicated GPU, CPU-only execution), the project pivoted from fine-tuning to hyperparameter optimization:
+- Fine-tuning deep learning models without GPU is prohibitively slow
+- Inference time is ~69 seconds/page on CPU
+- Hyperparameter optimization proved to be an effective alternative, achieving 80.9% CER reduction
+
+### Main Results
+
+| Model | CER | Character Accuracy |
+|-------|-----|-------------------|
+| PaddleOCR Baseline | 7.78% | 92.22% |
+| PaddleOCR-HyperAdjust | **1.49%** | **98.51%** |
+
+**Goal achieved:** CER < 2% (target was < 2%, result is 1.49%)
+
+### Optimal Configuration Found
+
+```python
+config_optimizada = {
+    "textline_orientation": True,           # CRITICAL - reduces CER ~70%
+    "use_doc_orientation_classify": False,
+    "use_doc_unwarping": False,
+    "text_det_thresh": 0.4690,
+    "text_det_box_thresh": 0.5412,
+    "text_det_unclip_ratio": 0.0,
+    "text_rec_score_thresh": 0.6350,
+}
+```
+
+### Key Findings
+
+1. `textline_orientation=True` is the most impactful parameter (reduces CER by 69.7%)
+2. `text_det_thresh` has -0.52 correlation with CER; values < 0.1 cause catastrophic failures
+3. Document correction modules (`use_doc_orientation_classify`, `use_doc_unwarping`) are unnecessary for digital PDFs
+
+## Repository Structure
+
+```
+MastersThesis/
+├── docs/                    # Thesis chapters in Markdown (UNIR template structure)
+│   ├── 00_resumen.md                      # Resumen + Abstract + Keywords
+│   ├── 01_introduccion.md                 # 1. Introducción (1.1, 1.2, 1.3)
+│   ├── 02_contexto_estado_arte.md         # 2. Contexto y estado del arte (2.1, 2.2, 2.3)
+│   ├── 03_objetivos_metodologia.md        # 3. Objetivos y metodología (3.1, 3.2, 3.3, 3.4)
+│   ├── 04_desarrollo_especifico.md        # 4. Desarrollo específico (4.1, 4.2, 4.3)
+│   ├── 05_conclusiones_trabajo_futuro.md  # 5. Conclusiones (5.1, 5.2)
+│   ├── 06_referencias_bibliograficas.md   # Referencias bibliográficas (APA format)
+│   └── 07_anexo_a.md                      # Anexo A: Código fuente y datos
+├── thesis_output/           # Generated thesis document
+│   ├── plantilla_individual.htm           # Complete TFM (open in Word)
+│   └── figures/                           # PNG figures from Mermaid diagrams
+│       ├── figura_1.png ... figura_7.png
+│       └── figures_manifest.json
+├── src/
+│   ├── paddle_ocr_fine_tune_unir_raytune.ipynb  # Main experiment (64 trials)
+│   ├── paddle_ocr_tuning.py                      # CLI evaluation script
+│   ├── dataset_manager.py                        # ImageTextDataset class
+│   ├── prepare_dataset.ipynb                     # Dataset preparation
+│   └── raytune_paddle_subproc_results_20251207_192320.csv  # 64 trial results
+├── results/                 # Benchmark results CSVs
+├── instructions/            # UNIR instructions and template
+│   ├── instrucciones.pdf         # TFE writing guidelines
+│   ├── plantilla_individual.pdf  # Word template (PDF version)
+│   └── plantilla_individual.htm  # Word template (HTML version, source)
+├── apply_content.py         # Generates TFM document from docs/ + template
+├── generate_mermaid_figures.py  # Converts Mermaid diagrams to PNG
+├── ocr_benchmark_notebook.ipynb  # Initial OCR benchmark
+└── README.md
+```
+
+### docs/ to Template Mapping
+
+The template (`plantilla_individual.pdf`) requires **5 chapters**. The docs/ files now match this structure exactly:
+
+| Template Section | docs/ File | Notes |
+|-----------------|------------|-------|
+| Resumen | `00_resumen.md` (Spanish part) | 150-300 words + Palabras clave |
+| Abstract | `00_resumen.md` (English part) | 150-300 words + Keywords |
+| 1. Introducción | `01_introduccion.md` | Subsections 1.1, 1.2, 1.3 |
+| 2. Contexto y estado del arte | `02_contexto_estado_arte.md` | Subsections 2.1, 2.2, 2.3 + Mermaid diagrams |
+| 3. Objetivos y metodología | `03_objetivos_metodologia.md` | Subsections 3.1, 3.2, 3.3, 3.4 + Mermaid diagrams |
+| 4. Desarrollo específico | `04_desarrollo_especifico.md` | Subsections 4.1, 4.2, 4.3 + Mermaid charts |
+| 5. Conclusiones y trabajo futuro | `05_conclusiones_trabajo_futuro.md` | Subsections 5.1, 5.2 |
+| Referencias bibliográficas | `06_referencias_bibliograficas.md` | APA, alphabetical |
+| Anexo A | `07_anexo_a.md` | Repository URL + structure |
+
+## Important Data Files
+
+### Results CSV Files
+- `src/raytune_paddle_subproc_results_20251207_192320.csv` - 64 Ray Tune trials with configs and metrics (PRIMARY DATA SOURCE)
+
+### Key Notebooks
+- `src/paddle_ocr_fine_tune_unir_raytune.ipynb` - Main Ray Tune experiment
+- `src/prepare_dataset.ipynb` - PDF to image/text conversion
+- `ocr_benchmark_notebook.ipynb` - EasyOCR vs PaddleOCR vs DocTR comparison
+
+## Technical Stack
+
+| Component | Version |
+|-----------|---------|
+| Python | 3.11.9 |
+| PaddlePaddle | 3.2.2 |
+| PaddleOCR | 3.3.2 |
+| Ray | 2.52.1 |
+| Optuna | 4.6.0 |
+
+## Pending Work
+
+### Completed Tasks
+- [x] **Structure docs/ to match UNIR template** - All chapters now follow exact numbering (1.1, 1.2, etc.)
+- [x] **Add Mermaid diagrams** - 7 diagrams added (OCR pipeline, Ray Tune architecture, methodology flowcharts, CER comparison charts)
+- [x] **Generate unified thesis document** - `apply_content.py` generates complete document from docs/
+- [x] **Convert Mermaid to PNG** - `generate_mermaid_figures.py` generates figures automatically
+- [x] **Proper template formatting** - Tables/figures use `Piedefoto-tabla` class, references use `MsoBibliography`
+
+### Priority Tasks
+1. **Validate on other document types** - Test optimal config on invoices, forms, contracts
+2. **Expand dataset** - Current dataset has only 24 pages
+3. **Create presentation slides** - For thesis defense
+4. **Final document review** - Open in Word, update indices (Ctrl+A, F9), verify formatting
+
+### Optional Extensions
+- Explore `text_det_unclip_ratio` parameter (was fixed at 0.0)
+- Compare with actual fine-tuning (if GPU access obtained)
+- Multi-objective optimization (CER + WER + inference time)
+
+## Thesis Document Generation
+
+To regenerate the thesis document:
+
+```bash
+# 1. Generate PNG figures from Mermaid diagrams
+python3 generate_mermaid_figures.py
+
+# 2. Apply docs/ content to UNIR template
+python3 apply_content.py
+
+# 3. Open in Word and finalize
+# - Open thesis_output/plantilla_individual.htm in Microsoft Word
+# - Press Ctrl+A then F9 to update all indices
+# - Save as .docx
+```
+
+**What `apply_content.py` does:**
+- Replaces Resumen and Abstract with actual content + keywords
+- Replaces all 5 chapters with content from docs/
+- Replaces Referencias with APA-formatted bibliography
+- Replaces Anexo with repository information
+- Converts Mermaid diagrams to embedded PNG images
+- Formats tables with `Piedefoto-tabla` captions and sources
+- Removes template instruction text ("Importante:", "Ejemplo de nota al pie", etc.)
+
+---
+
+## UNIR TFE Document Guidelines
+
+**CRITICAL:** The thesis MUST follow UNIR's official template (`instructions/plantilla_individual.pdf`) and guidelines (`instructions/instrucciones.pdf`).
+
+### Work Type Classification
+
+This thesis is a **hybrid of Type 1 (Piloto experimental) and Type 3 (Comparativa de soluciones)**:
+- Comparative study of OCR solutions (EasyOCR, PaddleOCR, DocTR)
+- Experimental pilot with Ray Tune hyperparameter optimization
+- 64 trials executed, results analyzed statistically
+
+### Document Structure (from plantilla_individual.pdf - MANDATORY)
+
+The TFE must follow this EXACT structure from the official template:
+
+| Section | Subsections | Notes |
+|---------|-------------|-------|
+| **Portada** | Title, Author, Type, Director, Date | Use template format exactly |
+| **Resumen** | 150-300 words + 3-5 Palabras clave | Spanish summary |
+| **Abstract** | 150-300 words + 3-5 Keywords | English summary |
+| **Índice de contenidos** | Auto-generated | New page |
+| **Índice de figuras** | Auto-generated | New page |
+| **Índice de tablas** | Auto-generated | New page |
+| **1. Introducción** | 1.1 Motivación, 1.2 Planteamiento del trabajo, 1.3 Estructura del trabajo | 3-5 pages |
+| **2. Contexto y estado del arte** | 2.1 Contexto del problema, 2.2 Estado del arte, 2.3 Conclusiones | 10-15 pages |
+| **3. Objetivos concretos y metodología** | 3.1 Objetivo general, 3.2 Objetivos específicos, 3.3 Metodología del trabajo | Variable |
+| **4. Desarrollo específico** | Varies by work type (see below) | Main content |
+| **5. Conclusiones y trabajo futuro** | 5.1 Conclusiones, 5.2 Líneas de trabajo futuro | Variable |
+| **Referencias bibliográficas** | APA format, alphabetical, hanging indent | Variable |
+| **Anexo A** | Código fuente y datos analizados | Repository URL |
+
+**Total length:** 50-90 pages (excluding cover, resumen, abstract, indices, annexes)
+
+### Chapter-Specific Requirements (from plantilla_individual.pdf)
+
+#### 1. Introducción
+The introduction must give a clear first idea of what was intended, the conclusions reached, and the procedure followed. Key ideas: problem identification, justification of importance, general objectives, preview of contribution.
+
+**1.1 Motivación:**
+- Present the problem to solve
+- Justify importance to educational/scientific community
+- Answer: What problem? What are the causes? Why is it relevant?
+- Must include references to prior research
+
+**1.2 Planteamiento del trabajo:**
+- Briefly state the problem/need detected
+- Describe the proposal and purpose
+- Answer: How to solve? What is proposed?
+
+**1.3 Estructura del trabajo:**
+- Briefly describe what each subsequent chapter contains
+
+#### 2. Contexto y estado del arte
+Study the application domain in depth, citing numerous references. Must consult different sources (not just online - also technical manuals, books).
+
+**2.1 Contexto del problema:**
+- Deep study of the application domain
+
+**2.2 Estado del arte:**
+- Antecedents, current studies, comparison of existing tools
+- Must reference key authors in the field (justify exclusions)
+
+**2.3 Conclusiones:**
+- Summary linking research to the work to be done
+- How findings affect the specific development
+
+#### 3. Objetivos concretos y metodología de trabajo
+Bridge between domain study and contribution. Three required elements: (1) general objective, (2) specific objectives, (3) methodology.
+
+**3.1 Objetivo general:**
+- Must be SMART (Doran, 1981)
+- Focus on achieving an observable effect, not just "create a tool"
+- Example: "Mejorar el servicio X logrando Y valorado positivamente (mínimo 4/5) por Z"
+
+**3.2 Objetivos específicos:**
+- Divide general objective into analyzable sub-objectives
+- Must be SMART
+- Use infinitive verbs: Analizar, Calcular, Clasificar, Comparar, Conocer, Cuantificar, Desarrollar, Describir, Descubrir, Determinar, Establecer, Explorar, Identificar, Indagar, Medir, Sintetizar, Verificar
+- Typically ~5 objectives: 1-2 about state of art, 2-3 about development
+
+**3.3 Metodología del trabajo:**
+- Describe steps to achieve objectives
+- Explain WHY each step
+- What instruments will be used
+- How results will be analyzed
+
+#### 4. Desarrollo específico de la contribución
+Structure depends on work type. Organize by methodology phases/activities.
+
+**For Type 1 (Piloto experimental):**
+- 4.1 Descripción detallada del experimento
+  - Technologies used (with justification)
+  - How pilot was organized
+  - Participants (demographics)
+  - Automatic evaluation techniques
+  - How experiment proceeded
+  - Monitoring/evaluation instruments
+  - Statistical analysis types
+- 4.2 Descripción de los resultados (objective, no interpretation)
+  - Summary tables, result graphs, relevant data identification
+- 4.3 Discusión
+  - Relevance of results, explanations for anomalies, highlight key findings
+
+**For Type 3 (Comparativa de soluciones):**
+- 4.1 Planteamiento de la comparativa
+  - Problem identification, alternative solutions to evaluate
+  - Success criteria, measures to take
+- 4.2 Desarrollo de la comparativa
+  - All results and measurements obtained
+  - Graphs, tables, data visualization
+- 4.3 Discusión y análisis de resultados
+  - Discussion of meaning, advantages/disadvantages of solutions
+
+#### 5. Conclusiones y trabajo futuro
+
+**5.1 Conclusiones:**
+- Summary of problem, approach, and why solution is valid
+- Summary of contributions
+- **Relate contributions and results to objectives** - discuss degree of achievement
+
+**5.2 Líneas de trabajo futuro:**
+- Future work that would add value
+- Justify how contribution can be used and in what fields
+
+### SMART Objectives Requirements
+
+ALL objectives (general and specific) MUST be SMART:
+
+| Criterion | Requirement | Example from this thesis |
+|-----------|-------------|-------------------------|
+| **S**pecific | Clearly define what to achieve | "Optimizar PaddleOCR para documentos en español" |
+| **M**easurable | Quantifiable success metric | "CER < 2%" |
+| **A**ttainable | Feasible with available resources | "Sin GPU, usando optimización de hiperparámetros" |
+| **R**elevant | Demonstrable impact | "Mejora extracción de texto en documentos académicos" |
+| **T**ime-bound | Achievable in timeframe | "Un cuatrimestre" |
+
+### Citation and Reference Rules
+
+#### APA Format is MANDATORY
+
+Reference guide: https://bibliografiaycitas.unir.net/
+
+**In-text citations:**
+- Single author: (Du, 2020) or Du (2020)
+- Two authors: (Du & Li, 2020)
+- Three+ authors: (Du et al., 2020)
+
+**Reference list examples:**
+```
+# Journal article with DOI
+Shi, B., Bai, X., & Yao, C. (2016). An end-to-end trainable neural network
+  for image-based sequence recognition. IEEE Transactions on Pattern
+  Analysis and Machine Intelligence, 39(11), 2298-2304.
+  https://doi.org/10.1109/TPAMI.2016.2646371
+
+# Conference paper
+Akiba, T., Sano, S., Yanase, T., Ohta, T., & Koyama, M. (2019). Optuna:
+  A next-generation hyperparameter optimization framework. Proceedings
+  of the 25th ACM SIGKDD, 2623-2631.
+  https://doi.org/10.1145/3292500.3330701
+
+# arXiv preprint
+Du, Y., Li, C., Guo, R., ... & Wang, H. (2020). PP-OCR: A practical ultra
+  lightweight OCR system. arXiv preprint arXiv:2009.09941.
+  https://arxiv.org/abs/2009.09941
+
+# Software/GitHub repository
+PaddlePaddle. (2024). PaddleOCR: Awesome multilingual OCR toolkits based
+  on PaddlePaddle. GitHub. https://github.com/PaddlePaddle/PaddleOCR
+
+# Book
+Cohen, J. (1988). Statistical power analysis for the behavioral sciences
+  (2nd ed.). Lawrence Erlbaum Associates.
+```
+
+#### Reference Rules
+- **NO Wikipedia citations**
+- Include variety: books, conferences, journal articles (not just URLs)
+- All cited references must appear in reference list
+- All references in list must be cited in text
+- Order alphabetically by first author's surname
+- Include DOI or URL when available
+
+### Document Formatting Rules
+
+#### Page Setup
+| Element | Specification |
+|---------|--------------|
+| Page size | A4 |
+| Left margin | 3.0 cm |
+| Right margin | 2.0 cm |
+| Top/Bottom margins | 2.5 cm |
+| Header | Student name + TFE title |
+| Footer | Page number |
+
+#### Typography
+| Element | Format |
+|---------|--------|
+| Body text | Calibri 12, justified, 1.5 line spacing, 6pt before/after |
+| Título 1 | Calibri Light 18, blue, justified, 1.5 spacing |
+| Título 2 | Calibri Light 14, blue, justified, 1.5 spacing |
+| Título 3 | Calibri Light 12, justified, 1.5 spacing |
+| Footnotes | Calibri 10, justified, single spacing |
+| Code | Can reduce to 9pt if needed |
+
+#### Tables and Figures (from plantilla_individual.pdf)
+
+**Table format example:**
+```
+Tabla 1. Ejemplo de tabla con sus principales elementos.
+[TABLE CONTENT]
+Fuente: American Psychological Association, 2020a.
+```
+
+**Figure format example:**
+```
+Figura 1. Ejemplo de figura realizada para nuestro trabajo.
+[FIGURE]
+Fuente: American Psychological Association, 2020b.
+```
+
+**Rules:**
+- **Title position**: Above the table/figure
+- **Numbering format**: "**Tabla 1.**" / "**Figura 1.**" (Calibri 12, bold)
+- **Title text**: Calibri 12, italic (after the number)
+- **Source**: Below, centered, format "Fuente: Author, Year."
+- Can reduce font to 9pt for dense tables
+- Can use landscape orientation for large tables
+- Tables should have horizontal lines only (no vertical lines) per APA style
+
+### Writing Style Rules
+
+#### MUST DO:
+- Each chapter starts with introductory paragraph explaining content
+- Each paragraph has at least 3 sentences
+- Verify originality (cite all sources)
+- Check spelling with Word corrector
+- Ensure logical flow between paragraphs
+- Define concepts and include pertinent citations
+
+#### MUST NOT DO:
+- Two consecutive headings without text between them
+- Superfluous phrases and repetition of ideas
+- Short paragraphs (less than 3 sentences)
+- Missing figure/table numbers or titles
+- Broken index generation
+
+### Annexes Requirements
+
+**Anexo A - Código fuente y datos:**
+- Include repository URL where code is hosted
+- Student must be sole author and owner of repository
+- No commits from other users
+- Data used should also be in repository
+- If confidential (company project), justify why not shared
+
+### Final Submission
+
+- **Drafts**: Submit in Word format
+- **Final deposit**: Submit in PDF format
+- Verify all indices generate correctly before final submission
+
+---
+
+## Guidelines for Claude
+
+### CRITICAL: Academic Rigor Requirements
+
+**This is a Master's Thesis. Academic rigor is NON-NEGOTIABLE.**
+
+#### DO NOT:
+- **NEVER fabricate data or statistics** - Every number must come from an actual file in this repository
+- **NEVER invent comparison results** - If we don't have data for EasyOCR or DocTR comparisons, don't make up numbers
+- **NEVER assume or estimate values** - If a metric isn't in the CSV/notebook, don't include it
+- **NEVER extrapolate beyond what the data shows** - 24 pages is a limited dataset, acknowledge this
+- **NEVER claim results that weren't measured** - Only report what was actually computed
+
+#### ALWAYS:
+- **Read the source file first** before citing any result
+- **Quote exact values** from CSV files (e.g., CER 0.011535 not "approximately 1%")
+- **Reference the specific file and location** for every data point
+- **Acknowledge limitations** explicitly (dataset size, CPU-only, single document type)
+- **Distinguish between measured results and interpretations**
+
+#### Data Sources (ONLY use these):
+| Data Type | Source File |
+|-----------|-------------|
+| Ray Tune 64 trials | `src/raytune_paddle_subproc_results_20251207_192320.csv` |
+| Experiment code | `src/paddle_ocr_fine_tune_unir_raytune.ipynb` |
+| Final comparison | Output cells in the notebook (baseline vs optimized) |
+
+#### Example of WRONG vs RIGHT:
+
+**WRONG:** "EasyOCR achieved 8.5% CER while PaddleOCR achieved 5.2% CER"
+(We don't have this comparison data in our results files)
+
+**RIGHT:** "The optimization reduced CER from 7.78% to 1.49%, a reduction of 80.9% (source: final comparison in `paddle_ocr_fine_tune_unir_raytune.ipynb`)"
+
+**WRONG:** "The optimization improved results by approximately 80%"
+
+**RIGHT:** "From the 64 trials in `raytune_paddle_subproc_results_20251207_192320.csv`, minimum CER achieved was 1.15%"
+
+### When Working on Documentation
+
+1. **Read UNIR guidelines first**: Check `instructions/instrucciones.pdf` for structure requirements
+
+2. **Follow chapter structure**: Each chapter has specific content requirements per UNIR guidelines
+
+3. **References are UNIFIED**: All references go in `docs/06_referencias_bibliograficas.md`, NOT per-chapter
+
+4. **Use APA format**: All citations must follow APA style
+
+5. **Include "Fuentes de datos"**: Each chapter should list which repository files the data came from
+
+6. **Language**: Documentation is in Spanish (thesis requirement), code comments in English
+
+7. **Hardware context**: Remember this is CPU-only execution. Any suggestions about GPU training should acknowledge this limitation
+
+8. **When in doubt, ask**: If the user requests data that doesn't exist, ask rather than inventing numbers
+
+9. **DIAGRAMS MUST BE IN MERMAID FORMAT**: All diagrams, flowcharts, and visualizations in the documentation MUST use Mermaid syntax. This ensures:
+   - Version control friendly (text-based)
+   - Consistent styling across all chapters
+   - Easy to edit and maintain
+   - Renders properly in GitHub and most Markdown viewers
+
+   **Supported Mermaid diagram types:**
+   - `flowchart` / `graph` - For pipelines, workflows, architectures
+   - `xychart-beta` - For bar charts, comparisons
+   - `sequenceDiagram` - For process interactions
+   - `classDiagram` - For class structures
+   - `stateDiagram` - For state machines
+   - `pie` - For proportional data
+
+   **Example:**
+   ```mermaid
+   flowchart LR
+       A[Input] --> B[Process] --> C[Output]
+   ```
+
+### Common Tasks
+
+- **Adding new experiments**: Update `src/paddle_ocr_fine_tune_unir_raytune.ipynb`
+- **Updating documentation**: Edit files in `docs/`
+- **Adding references**: Add to `docs/06_referencias_bibliograficas.md` (unified list)
+- **Dataset expansion**: Use `src/prepare_dataset.ipynb` as template
+- **Running evaluations**: Use `src/paddle_ocr_tuning.py` CLI
+
+---
+
+## Experiment Details
+
+### Ray Tune Configuration
+```python
+tuner = tune.Tuner(
+    trainable_paddle_ocr,
+    tune_config=tune.TuneConfig(
+        metric="CER",
+        mode="min",
+        search_alg=OptunaSearch(),
+        num_samples=64,
+        max_concurrent_trials=2
+    )
+)
+```
+
+### Dataset
+- Source: UNIR TFE instructions PDF
+- Pages: 24
+- Resolution: 300 DPI
+- Ground truth: Extracted via PyMuPDF
+
+### Metrics
+- CER (Character Error Rate) - Primary metric
+- WER (Word Error Rate) - Secondary metric
+- Calculated using `jiwer` library