.claude/commands/documentation-review.md

Review and validate the documentation for this Master's Thesis project.

## Instructions

1. **Read metrics source files first** to get the correct values:
   - `docs/metrics/metrics_paddle.md` - PaddleOCR results
   - `docs/metrics/metrics_doctr.md` - DocTR results
   - `docs/metrics/metrics_easyocr.md` - EasyOCR results
   - `docs/metrics/metrics.md` - Comparative summary
   - `src/results/*.csv` - Raw data from 64 trials per service (5-page tuning subset)
   - `src/*/requirements.txt` - Dependency versions used for the experiments

2. **Review UNIR guidelines** for formatting and structure rules:
   - **`instructions/plantilla_individual.htm`** - **PRIMARY REFERENCE** for all styling (CSS classes, Word styles)
   - **`instructions/plantilla_individual_files/`** - Support files with additional style definitions
   - `instructions/instrucciones.pdf` - TFE writing instructions
   - `instructions/plantilla_individual.pdf` - Official template preview

   **IMPORTANT:** When styling elements (tables, figures, notes, quotes), ALWAYS check `plantilla_individual.htm` for existing Word/CSS classes (e.g., `MsoQuote`, `MsoCaption`, `Piedefoto-tabla`). Use these classes instead of custom inline styles.

### UNIR Color Palette (from plantilla_individual.htm)

| Color | Hex | Usage |
|-------|-----|-------|
| Primary Blue | `#0098CD` | Headings, titles, diagram borders |
| Light Blue BG | `#E6F4F9` | Backgrounds, callout boxes, nodes |
| Dark Gray | `#404040` | Primary text |
| Accent Blue | `#5B9BD5` | Table headers, accent elements |
| Light Accent | `#9CC2E5` | Table borders |
| Very Light Blue | `#DEEAF6` | Secondary backgrounds, subgraphs |
| White | `#FFFFFF` | Header text, contrast |

### Table Styles (from template)
- `MsoTableGrid` - Basic grid table
- `MsoTable15Grid4Accent1` - Styled table with UNIR colors (header: `#5B9BD5`, borders: `#9CC2E5`)
- `Piedefoto-tabla` - Table caption/source style

3. **Validate each documentation file** checking:

### Data Accuracy
- All CER/WER values must match those in `docs/metrics/*.md`
- Verify: baseline, optimized, best trial, percentage improvement
- Verify: GPU vs CPU acceleration factor
- Verify: dataset size (pages)

### UNIR Formatting
- Tables: `**Tabla N.** *Descriptive title in italics.*` followed by table, then a line that starts with `Fuente:` immediately after the table (no blank lines), e.g., `Fuente: ...`
  - Table titles must describe the content (e.g., "Comparación de modelos OCR")
- Figures: `**Figura N.** *Descriptive title in italics.*`
  - Figure titles must describe the content (e.g., "Pipeline de un sistema OCR moderno")
- Sequential numbering (no duplicates, no gaps)
- APA citation format for references

### Word Generation Alignment
- Table sources are only captured when the line **immediately after** the table starts with `Fuente:` (per `apply_content.py`).
- Mermaid figures use the YAML `title:` for captions in Word output; `**Figura N.**` lines are ignored by the generator but should remain for UNIR compliance.

### Mermaid Diagrams
- **All diagrams must be in Mermaid format** (no external images for flowcharts/charts)
- All Mermaid diagrams must use the UNIR color theme
- Required YAML frontmatter config (Mermaid v11+):
  ```mermaid
  ---
  title: "Diagram Title"
  config:
    theme: base
    themeVariables:
      primaryColor: "#E6F4F9"
      primaryTextColor: "#404040"
      primaryBorderColor: "#0098CD"
      lineColor: "#0098CD"
  ---
  flowchart LR
      A[Node] --> B[Node]
  ```
- Colors: `#0098CD` (UNIR blue for borders/lines), `#E6F4F9` (light blue background)
- All diagrams must have a descriptive `title:` in YAML frontmatter
- Titles MUST be quoted: `title: "Descriptive Title"` (not `title: Descriptive Title`)
- Titles should describe the diagram content, not generic "Diagrama N"
- Verify theme is applied to all diagrams in `docs/*.md`

**Note on Bar Charts (`xychart-beta`):**
- Bar chart colors are **automatically converted to light blue** (`#0098CD`) during figure generation
- The `xyChart.plotColorPalette` config in YAML frontmatter does NOT work reliably with mmdc
- Instead, `generate_mermaid_figures.py` post-processes SVG to replace default colors (`#ECECFF`, `#FFF4DD`)
- No manual color configuration needed in xychart-beta blocks - they will be styled automatically

### Files to Review
- `docs/00_resumen.md` - Resumen/Abstract
- `docs/01_introduccion.md` - Introducción
- `docs/02_contexto_estado_arte.md` - Contexto y estado del arte
- `docs/03_objetivos_metodologia.md` - Objetivos y metodología
- `docs/04_desarrollo_especifico.md` - Desarrollo específico (resultados)
- `docs/05_conclusiones_trabajo_futuro.md` - Conclusiones y trabajo futuro
- `docs/06_referencias_bibliograficas.md` - Referencias
- `docs/07_anexo_a.md` - Anexo técnico
- `README.md` - Project overview

4. **Report findings** with:
   - List of incorrect values found (with file:line references)
   - Formatting issues detected
   - Specific corrections needed
   - Overall documentation health assessment

5. **Language**: All docs/* files must be in Spanish. README.md and CLAUDE.md can be in English.

6. **Audit Run (repeatable process)**:
   - Validate each Mermaid diagram that contains numbers against its stated source (CSV or metrics file).
   - Confirm every figure/table that includes metrics has a valid `*Fuente:*` line pointing to:
     - `src/results/*.csv`, `src/results/correlations/*.csv`, or `docs/metrics/*.md`, or
     - External sources listed in `docs/07_anexo_a.md`.
   - Record any missing or mismatched sources before making edits.

## Writing Style (Required)

- Use fluent Spanish with standard punctuation, avoid long dashes.
- Prefer commas, semicolons, or short sentences over em dashes.
- Keep paragraphs concise and clear, avoid overly long sentences.

## Data Integrity (Required)

- Do not invent or estimate values. Every numeric claim must be sourced from `src/results/*.csv`, `docs/metrics/*.md`, or external documentation explicitly listed in `docs/07_anexo_a.md`.
- If a value is not present in those sources, remove it or mark it as unknown and request clarification.
- Source of truth for OCR metrics in `docs/00-07`: use `docs/metrics/*.md` for both "Resultados del Subconjunto de Ajuste" and "Evaluación del Dataset Completo", and `src/results/*.csv` for tuning subset values referenced by those sections.

## CSV Verification (Required)

Use the CSVs to validate best-trial values and to confirm that tuning-only figures are not confused with full-dataset results.


### Interpretation Rules

- The CSVs are from tuning on pages 5-10, not the full 45-page dataset.
- Values like “best trial CER” and “best trial WER” must match the CSVs.
- Full-dataset metrics must be sourced elsewhere and clearly labeled as full evaluation.
- `src/raytune_paddle_subproc_results_20251207_192320.csv` is CPU-only timing reference; do not use it for accuracy claims.
- GPU results are the primary research driver. CPU results are only used to illustrate timing without GPU.
Documentation review and data consistency. 2026-01-24 15:53:34 +01:00			`Review and validate the documentation for this Master's Thesis project.`

			`## Instructions`

			`1. Read metrics source files first to get the correct values:`
			- `docs/metrics/metrics_paddle.md` - PaddleOCR results
			- `docs/metrics/metrics_doctr.md` - DocTR results
			- `docs/metrics/metrics_easyocr.md` - EasyOCR results
			- `docs/metrics/metrics.md` - Comparative summary
			- `src/results/*.csv` - Raw data from 64 trials per service (5-page tuning subset)
			- `src/*/requirements.txt` - Dependency versions used for the experiments

			`2. Review UNIR guidelines for formatting and structure rules:`
			- `instructions/plantilla_individual.htm` - PRIMARY REFERENCE for all styling (CSS classes, Word styles)
			- `instructions/plantilla_individual_files/` - Support files with additional style definitions
			- `instructions/instrucciones.pdf` - TFE writing instructions
			- `instructions/plantilla_individual.pdf` - Official template preview

			IMPORTANT: When styling elements (tables, figures, notes, quotes), ALWAYS check `plantilla_individual.htm` for existing Word/CSS classes (e.g., `MsoQuote`, `MsoCaption`, `Piedefoto-tabla`). Use these classes instead of custom inline styles.

			`### UNIR Color Palette (from plantilla_individual.htm)`

			`\| Color \| Hex \| Usage \|`
			`\|-------\|-----\|-------\|`
			\| Primary Blue \| `#0098CD` \| Headings, titles, diagram borders \|
			\| Light Blue BG \| `#E6F4F9` \| Backgrounds, callout boxes, nodes \|
			\| Dark Gray \| `#404040` \| Primary text \|
			\| Accent Blue \| `#5B9BD5` \| Table headers, accent elements \|
			\| Light Accent \| `#9CC2E5` \| Table borders \|
			\| Very Light Blue \| `#DEEAF6` \| Secondary backgrounds, subgraphs \|
			\| White \| `#FFFFFF` \| Header text, contrast \|

			`### Table Styles (from template)`
			- `MsoTableGrid` - Basic grid table
			- `MsoTable15Grid4Accent1` - Styled table with UNIR colors (header: `#5B9BD5`, borders: `#9CC2E5`)
			- `Piedefoto-tabla` - Table caption/source style

			`3. Validate each documentation file checking:`

			`### Data Accuracy`
			- All CER/WER values must match those in `docs/metrics/*.md`
			`- Verify: baseline, optimized, best trial, percentage improvement`
			`- Verify: GPU vs CPU acceleration factor`
			`- Verify: dataset size (pages)`

			`### UNIR Formatting`
			- Tables: `Tabla N. Descriptive title in italics.` followed by table, then a line that starts with `Fuente:` immediately after the table (no blank lines), e.g., `Fuente: ...`
			`- Table titles must describe the content (e.g., "Comparación de modelos OCR")`
			- Figures: `Figura N. Descriptive title in italics.`
			`- Figure titles must describe the content (e.g., "Pipeline de un sistema OCR moderno")`
			`- Sequential numbering (no duplicates, no gaps)`
			`- APA citation format for references`

			`### Word Generation Alignment`
			- Table sources are only captured when the line immediately after the table starts with `Fuente:` (per `apply_content.py`).
			- Mermaid figures use the YAML `title:` for captions in Word output; `Figura N.` lines are ignored by the generator but should remain for UNIR compliance.

			`### Mermaid Diagrams`
			`- All diagrams must be in Mermaid format (no external images for flowcharts/charts)`
			`- All Mermaid diagrams must use the UNIR color theme`
			`- Required YAML frontmatter config (Mermaid v11+):`
			```mermaid
			`---`
			`title: "Diagram Title"`
			`config:`
			`theme: base`
			`themeVariables:`
			`primaryColor: "#E6F4F9"`
			`primaryTextColor: "#404040"`
			`primaryBorderColor: "#0098CD"`
			`lineColor: "#0098CD"`
			`---`
			`flowchart LR`
			`A[Node] --> B[Node]`
			```
			- Colors: `#0098CD` (UNIR blue for borders/lines), `#E6F4F9` (light blue background)
			- All diagrams must have a descriptive `title:` in YAML frontmatter
			- Titles MUST be quoted: `title: "Descriptive Title"` (not `title: Descriptive Title`)
			`- Titles should describe the diagram content, not generic "Diagrama N"`
			- Verify theme is applied to all diagrams in `docs/*.md`

			Note on Bar Charts (`xychart-beta`):
			- Bar chart colors are automatically converted to light blue (`#0098CD`) during figure generation
			- The `xyChart.plotColorPalette` config in YAML frontmatter does NOT work reliably with mmdc
			- Instead, `generate_mermaid_figures.py` post-processes SVG to replace default colors (`#ECECFF`, `#FFF4DD`)
			`- No manual color configuration needed in xychart-beta blocks - they will be styled automatically`

			`### Files to Review`
			- `docs/00_resumen.md` - Resumen/Abstract
			- `docs/01_introduccion.md` - Introducción
			- `docs/02_contexto_estado_arte.md` - Contexto y estado del arte
			- `docs/03_objetivos_metodologia.md` - Objetivos y metodología
			- `docs/04_desarrollo_especifico.md` - Desarrollo específico (resultados)
			- `docs/05_conclusiones_trabajo_futuro.md` - Conclusiones y trabajo futuro
			- `docs/06_referencias_bibliograficas.md` - Referencias
			- `docs/07_anexo_a.md` - Anexo técnico
			- `README.md` - Project overview

			`4. Report findings with:`
			`- List of incorrect values found (with file:line references)`
			`- Formatting issues detected`
			`- Specific corrections needed`
			`- Overall documentation health assessment`

			`5. Language: All docs/* files must be in Spanish. README.md and CLAUDE.md can be in English.`

			`6. Audit Run (repeatable process):`
			`- Validate each Mermaid diagram that contains numbers against its stated source (CSV or metrics file).`
			- Confirm every figure/table that includes metrics has a valid `Fuente:` line pointing to:
			- `src/results/.csv`, `src/results/correlations/.csv`, or `docs/metrics/*.md`, or
			- External sources listed in `docs/07_anexo_a.md`.
			`- Record any missing or mismatched sources before making edits.`

			`## Writing Style (Required)`

			`- Use fluent Spanish with standard punctuation, avoid long dashes.`
			`- Prefer commas, semicolons, or short sentences over em dashes.`
			`- Keep paragraphs concise and clear, avoid overly long sentences.`

			`## Data Integrity (Required)`

			- Do not invent or estimate values. Every numeric claim must be sourced from `src/results/.csv`, `docs/metrics/.md`, or external documentation explicitly listed in `docs/07_anexo_a.md`.
			`- If a value is not present in those sources, remove it or mark it as unknown and request clarification.`
			- Source of truth for OCR metrics in `docs/00-07`: use `docs/metrics/.md` for both "Resultados del Subconjunto de Ajuste" and "Evaluación del Dataset Completo", and `src/results/.csv` for tuning subset values referenced by those sections.

			`## CSV Verification (Required)`

			`Use the CSVs to validate best-trial values and to confirm that tuning-only figures are not confused with full-dataset results.`


			`### Interpretation Rules`

			`- The CSVs are from tuning on pages 5-10, not the full 45-page dataset.`
			`- Values like “best trial CER” and “best trial WER” must match the CSVs.`
			`- Full-dataset metrics must be sourced elsewhere and clearly labeled as full evaluation.`
			- `src/raytune_paddle_subproc_results_20251207_192320.csv` is CPU-only timing reference; do not use it for accuracy claims.
			`- GPU results are the primary research driver. CPU results are only used to illustrate timing without GPU.`