Review and validate the documentation for this Master's Thesis project. ## Instructions 1. **Read metrics source files first** to get the correct values: - `docs/metrics/metrics_paddle.md` - PaddleOCR results - `docs/metrics/metrics_doctr.md` - DocTR results - `docs/metrics/metrics_easyocr.md` - EasyOCR results - `docs/metrics/metrics.md` - Comparative summary - `src/results/*.csv` - Raw data from 64 trials per service (5-page tuning subset) - `src/*/requirements.txt` - Dependency versions used for the experiments 2. **Review UNIR guidelines** for formatting and structure rules: - **`instructions/plantilla_individual.htm`** - **PRIMARY REFERENCE** for all styling (CSS classes, Word styles) - **`instructions/plantilla_individual_files/`** - Support files with additional style definitions - `instructions/instrucciones.pdf` - TFE writing instructions - `instructions/plantilla_individual.pdf` - Official template preview **IMPORTANT:** When styling elements (tables, figures, notes, quotes), ALWAYS check `plantilla_individual.htm` for existing Word/CSS classes (e.g., `MsoQuote`, `MsoCaption`, `Piedefoto-tabla`). Use these classes instead of custom inline styles. ### UNIR Color Palette (from plantilla_individual.htm) | Color | Hex | Usage | |-------|-----|-------| | Primary Blue | `#0098CD` | Headings, titles, diagram borders | | Light Blue BG | `#E6F4F9` | Backgrounds, callout boxes, nodes | | Dark Gray | `#404040` | Primary text | | Accent Blue | `#5B9BD5` | Table headers, accent elements | | Light Accent | `#9CC2E5` | Table borders | | Very Light Blue | `#DEEAF6` | Secondary backgrounds, subgraphs | | White | `#FFFFFF` | Header text, contrast | ### Table Styles (from template) - `MsoTableGrid` - Basic grid table - `MsoTable15Grid4Accent1` - Styled table with UNIR colors (header: `#5B9BD5`, borders: `#9CC2E5`) - `Piedefoto-tabla` - Table caption/source style 3. **Validate each documentation file** checking: ### Data Accuracy - All CER/WER values must match those in `docs/metrics/*.md` - Verify: baseline, optimized, best trial, percentage improvement - Verify: GPU vs CPU acceleration factor - Verify: dataset size (pages) ### UNIR Formatting - Tables: `**Tabla N.** *Descriptive title in italics.*` followed by table, then a line that starts with `Fuente:` immediately after the table (no blank lines), e.g., `Fuente: ...` - Table titles must describe the content (e.g., "Comparación de modelos OCR") - Figures: `**Figura N.** *Descriptive title in italics.*` - Figure titles must describe the content (e.g., "Pipeline de un sistema OCR moderno") - Sequential numbering (no duplicates, no gaps) - APA citation format for references ### Word Generation Alignment - Table sources are only captured when the line **immediately after** the table starts with `Fuente:` (per `apply_content.py`). - Mermaid figures use the YAML `title:` for captions in Word output; `**Figura N.**` lines are ignored by the generator but should remain for UNIR compliance. ### Mermaid Diagrams - **All diagrams must be in Mermaid format** (no external images for flowcharts/charts) - All Mermaid diagrams must use the UNIR color theme - Required YAML frontmatter config (Mermaid v11+): ```mermaid --- title: "Diagram Title" config: theme: base themeVariables: primaryColor: "#E6F4F9" primaryTextColor: "#404040" primaryBorderColor: "#0098CD" lineColor: "#0098CD" --- flowchart LR A[Node] --> B[Node] ``` - Colors: `#0098CD` (UNIR blue for borders/lines), `#E6F4F9` (light blue background) - All diagrams must have a descriptive `title:` in YAML frontmatter - Titles MUST be quoted: `title: "Descriptive Title"` (not `title: Descriptive Title`) - Titles should describe the diagram content, not generic "Diagrama N" - Verify theme is applied to all diagrams in `docs/*.md` **Note on Bar Charts (`xychart-beta`):** - Bar chart colors are **automatically converted to light blue** (`#0098CD`) during figure generation - The `xyChart.plotColorPalette` config in YAML frontmatter does NOT work reliably with mmdc - Instead, `generate_mermaid_figures.py` post-processes SVG to replace default colors (`#ECECFF`, `#FFF4DD`) - No manual color configuration needed in xychart-beta blocks - they will be styled automatically ### Files to Review - `docs/00_resumen.md` - Resumen/Abstract - `docs/01_introduccion.md` - Introducción - `docs/02_contexto_estado_arte.md` - Contexto y estado del arte - `docs/03_objetivos_metodologia.md` - Objetivos y metodología - `docs/04_desarrollo_especifico.md` - Desarrollo específico (resultados) - `docs/05_conclusiones_trabajo_futuro.md` - Conclusiones y trabajo futuro - `docs/06_referencias_bibliograficas.md` - Referencias - `docs/07_anexo_a.md` - Anexo técnico - `README.md` - Project overview 4. **Report findings** with: - List of incorrect values found (with file:line references) - Formatting issues detected - Specific corrections needed - Overall documentation health assessment 5. **Language**: All docs/* files must be in Spanish. README.md and CLAUDE.md can be in English. 6. **Audit Run (repeatable process)**: - Validate each Mermaid diagram that contains numbers against its stated source (CSV or metrics file). - Confirm every figure/table that includes metrics has a valid `*Fuente:*` line pointing to: - `src/results/*.csv`, `src/results/correlations/*.csv`, or `docs/metrics/*.md`, or - External sources listed in `docs/07_anexo_a.md`. - Record any missing or mismatched sources before making edits. ## Writing Style (Required) - Use fluent Spanish with standard punctuation, avoid long dashes. - Prefer commas, semicolons, or short sentences over em dashes. - Keep paragraphs concise and clear, avoid overly long sentences. ## Data Integrity (Required) - Do not invent or estimate values. Every numeric claim must be sourced from `src/results/*.csv`, `docs/metrics/*.md`, or external documentation explicitly listed in `docs/07_anexo_a.md`. - If a value is not present in those sources, remove it or mark it as unknown and request clarification. - Source of truth for OCR metrics in `docs/00-07`: use `docs/metrics/*.md` for both "Resultados del Subconjunto de Ajuste" and "Evaluación del Dataset Completo", and `src/results/*.csv` for tuning subset values referenced by those sections. ## CSV Verification (Required) Use the CSVs to validate best-trial values and to confirm that tuning-only figures are not confused with full-dataset results. ### Interpretation Rules - The CSVs are from tuning on pages 5-10, not the full 45-page dataset. - Values like “best trial CER” and “best trial WER” must match the CSVs. - Full-dataset metrics must be sourced elsewhere and clearly labeled as full evaluation. - `src/raytune_paddle_subproc_results_20251207_192320.csv` is CPU-only timing reference; do not use it for accuracy claims. - GPU results are the primary research driver. CPU results are only used to illustrate timing without GPU.