Files
MastersThesis/.claude/commands/documentation-review.md
sergio 0089b34cb3
Some checks failed
build_docker / essential (push) Successful in 0s
build_docker / build_paddle_ocr (push) Successful in 4m57s
build_docker / build_raytune (push) Has been cancelled
build_docker / build_easyocr_gpu (push) Has been cancelled
build_docker / build_doctr (push) Has been cancelled
build_docker / build_doctr_gpu (push) Has been cancelled
build_docker / build_paddle_ocr_gpu (push) Has been cancelled
build_docker / build_easyocr (push) Has been cancelled
Documentation review and data consistency.
2026-01-24 15:53:34 +01:00

6.9 KiB

Review and validate the documentation for this Master's Thesis project.

Instructions

  1. Read metrics source files first to get the correct values:

    • docs/metrics/metrics_paddle.md - PaddleOCR results
    • docs/metrics/metrics_doctr.md - DocTR results
    • docs/metrics/metrics_easyocr.md - EasyOCR results
    • docs/metrics/metrics.md - Comparative summary
    • src/results/*.csv - Raw data from 64 trials per service (5-page tuning subset)
    • src/*/requirements.txt - Dependency versions used for the experiments
  2. Review UNIR guidelines for formatting and structure rules:

    • instructions/plantilla_individual.htm - PRIMARY REFERENCE for all styling (CSS classes, Word styles)
    • instructions/plantilla_individual_files/ - Support files with additional style definitions
    • instructions/instrucciones.pdf - TFE writing instructions
    • instructions/plantilla_individual.pdf - Official template preview

    IMPORTANT: When styling elements (tables, figures, notes, quotes), ALWAYS check plantilla_individual.htm for existing Word/CSS classes (e.g., MsoQuote, MsoCaption, Piedefoto-tabla). Use these classes instead of custom inline styles.

UNIR Color Palette (from plantilla_individual.htm)

Color Hex Usage
Primary Blue #0098CD Headings, titles, diagram borders
Light Blue BG #E6F4F9 Backgrounds, callout boxes, nodes
Dark Gray #404040 Primary text
Accent Blue #5B9BD5 Table headers, accent elements
Light Accent #9CC2E5 Table borders
Very Light Blue #DEEAF6 Secondary backgrounds, subgraphs
White #FFFFFF Header text, contrast

Table Styles (from template)

  • MsoTableGrid - Basic grid table
  • MsoTable15Grid4Accent1 - Styled table with UNIR colors (header: #5B9BD5, borders: #9CC2E5)
  • Piedefoto-tabla - Table caption/source style
  1. Validate each documentation file checking:

Data Accuracy

  • All CER/WER values must match those in docs/metrics/*.md
  • Verify: baseline, optimized, best trial, percentage improvement
  • Verify: GPU vs CPU acceleration factor
  • Verify: dataset size (pages)

UNIR Formatting

  • Tables: **Tabla N.** *Descriptive title in italics.* followed by table, then a line that starts with Fuente: immediately after the table (no blank lines), e.g., Fuente: ...
    • Table titles must describe the content (e.g., "Comparación de modelos OCR")
  • Figures: **Figura N.** *Descriptive title in italics.*
    • Figure titles must describe the content (e.g., "Pipeline de un sistema OCR moderno")
  • Sequential numbering (no duplicates, no gaps)
  • APA citation format for references

Word Generation Alignment

  • Table sources are only captured when the line immediately after the table starts with Fuente: (per apply_content.py).
  • Mermaid figures use the YAML title: for captions in Word output; **Figura N.** lines are ignored by the generator but should remain for UNIR compliance.

Mermaid Diagrams

  • All diagrams must be in Mermaid format (no external images for flowcharts/charts)
  • All Mermaid diagrams must use the UNIR color theme
  • Required YAML frontmatter config (Mermaid v11+):
    ---
    title: "Diagram Title"
    config:
      theme: base
      themeVariables:
        primaryColor: "#E6F4F9"
        primaryTextColor: "#404040"
        primaryBorderColor: "#0098CD"
        lineColor: "#0098CD"
    ---
    flowchart LR
        A[Node] --> B[Node]
    
  • Colors: #0098CD (UNIR blue for borders/lines), #E6F4F9 (light blue background)
  • All diagrams must have a descriptive title: in YAML frontmatter
  • Titles MUST be quoted: title: "Descriptive Title" (not title: Descriptive Title)
  • Titles should describe the diagram content, not generic "Diagrama N"
  • Verify theme is applied to all diagrams in docs/*.md

Note on Bar Charts (xychart-beta):

  • Bar chart colors are automatically converted to light blue (#0098CD) during figure generation
  • The xyChart.plotColorPalette config in YAML frontmatter does NOT work reliably with mmdc
  • Instead, generate_mermaid_figures.py post-processes SVG to replace default colors (#ECECFF, #FFF4DD)
  • No manual color configuration needed in xychart-beta blocks - they will be styled automatically

Files to Review

  • docs/00_resumen.md - Resumen/Abstract
  • docs/01_introduccion.md - Introducción
  • docs/02_contexto_estado_arte.md - Contexto y estado del arte
  • docs/03_objetivos_metodologia.md - Objetivos y metodología
  • docs/04_desarrollo_especifico.md - Desarrollo específico (resultados)
  • docs/05_conclusiones_trabajo_futuro.md - Conclusiones y trabajo futuro
  • docs/06_referencias_bibliograficas.md - Referencias
  • docs/07_anexo_a.md - Anexo técnico
  • README.md - Project overview
  1. Report findings with:

    • List of incorrect values found (with file:line references)
    • Formatting issues detected
    • Specific corrections needed
    • Overall documentation health assessment
  2. Language: All docs/* files must be in Spanish. README.md and CLAUDE.md can be in English.

  3. Audit Run (repeatable process):

    • Validate each Mermaid diagram that contains numbers against its stated source (CSV or metrics file).
    • Confirm every figure/table that includes metrics has a valid *Fuente:* line pointing to:
      • src/results/*.csv, src/results/correlations/*.csv, or docs/metrics/*.md, or
      • External sources listed in docs/07_anexo_a.md.
    • Record any missing or mismatched sources before making edits.

Writing Style (Required)

  • Use fluent Spanish with standard punctuation, avoid long dashes.
  • Prefer commas, semicolons, or short sentences over em dashes.
  • Keep paragraphs concise and clear, avoid overly long sentences.

Data Integrity (Required)

  • Do not invent or estimate values. Every numeric claim must be sourced from src/results/*.csv, docs/metrics/*.md, or external documentation explicitly listed in docs/07_anexo_a.md.
  • If a value is not present in those sources, remove it or mark it as unknown and request clarification.
  • Source of truth for OCR metrics in docs/00-07: use docs/metrics/*.md for both "Resultados del Subconjunto de Ajuste" and "Evaluación del Dataset Completo", and src/results/*.csv for tuning subset values referenced by those sections.

CSV Verification (Required)

Use the CSVs to validate best-trial values and to confirm that tuning-only figures are not confused with full-dataset results.

Interpretation Rules

  • The CSVs are from tuning on pages 5-10, not the full 45-page dataset.
  • Values like “best trial CER” and “best trial WER” must match the CSVs.
  • Full-dataset metrics must be sourced elsewhere and clearly labeled as full evaluation.
  • src/raytune_paddle_subproc_results_20251207_192320.csv is CPU-only timing reference; do not use it for accuracy claims.
  • GPU results are the primary research driver. CPU results are only used to illustrate timing without GPU.