Files
MastersThesis/claude.md
2025-12-16 00:54:36 +01:00

22 KiB

Claude Code Context - Masters Thesis OCR Project

Project Overview

This is a Master's Thesis (TFM) for UNIR's Master in Artificial Intelligence. The project focuses on OCR hyperparameter optimization using Ray Tune with Optuna for Spanish academic documents.

Author: Sergio Jiménez Jiménez University: UNIR (Universidad Internacional de La Rioja) Year: 2025

Key Context

Why Hyperparameter Optimization Instead of Fine-tuning

Due to hardware limitations (no dedicated GPU, CPU-only execution), the project pivoted from fine-tuning to hyperparameter optimization:

  • Fine-tuning deep learning models without GPU is prohibitively slow
  • Inference time is ~69 seconds/page on CPU
  • Hyperparameter optimization proved to be an effective alternative, achieving 80.9% CER reduction

Main Results

Model CER Character Accuracy
PaddleOCR Baseline 7.78% 92.22%
PaddleOCR-HyperAdjust 1.49% 98.51%

Goal achieved: CER < 2% (target was < 2%, result is 1.49%)

Optimal Configuration Found

config_optimizada = {
    "textline_orientation": True,           # CRITICAL - reduces CER ~70%
    "use_doc_orientation_classify": False,
    "use_doc_unwarping": False,
    "text_det_thresh": 0.4690,
    "text_det_box_thresh": 0.5412,
    "text_det_unclip_ratio": 0.0,
    "text_rec_score_thresh": 0.6350,
}

Key Findings

  1. textline_orientation=True is the most impactful parameter (reduces CER by 69.7%)
  2. text_det_thresh has -0.52 correlation with CER; values < 0.1 cause catastrophic failures
  3. Document correction modules (use_doc_orientation_classify, use_doc_unwarping) are unnecessary for digital PDFs

Repository Structure

MastersThesis/
├── docs/                    # Thesis chapters in Markdown (UNIR template structure)
│   ├── 00_resumen.md                      # Resumen + Abstract + Keywords
│   ├── 01_introduccion.md                 # 1. Introducción (1.1, 1.2, 1.3)
│   ├── 02_contexto_estado_arte.md         # 2. Contexto y estado del arte (2.1, 2.2, 2.3)
│   ├── 03_objetivos_metodologia.md        # 3. Objetivos y metodología (3.1, 3.2, 3.3, 3.4)
│   ├── 04_desarrollo_especifico.md        # 4. Desarrollo específico (4.1, 4.2, 4.3)
│   ├── 05_conclusiones_trabajo_futuro.md  # 5. Conclusiones (5.1, 5.2)
│   ├── 06_referencias_bibliograficas.md   # Referencias bibliográficas (APA format)
│   └── 07_anexo_a.md                      # Anexo A: Código fuente y datos
├── thesis_output/           # Generated thesis document
│   ├── plantilla_individual.htm           # Complete TFM (open in Word)
│   └── figures/                           # PNG figures from Mermaid diagrams
│       ├── figura_1.png ... figura_7.png
│       └── figures_manifest.json
├── src/
│   ├── paddle_ocr_fine_tune_unir_raytune.ipynb  # Main experiment (64 trials)
│   ├── paddle_ocr_tuning.py                      # CLI evaluation script
│   ├── dataset_manager.py                        # ImageTextDataset class
│   ├── prepare_dataset.ipynb                     # Dataset preparation
│   └── raytune_paddle_subproc_results_20251207_192320.csv  # 64 trial results
├── results/                 # Benchmark results CSVs
├── instructions/            # UNIR instructions and template
│   ├── instrucciones.pdf         # TFE writing guidelines
│   ├── plantilla_individual.pdf  # Word template (PDF version)
│   └── plantilla_individual.htm  # Word template (HTML version, source)
├── apply_content.py         # Generates TFM document from docs/ + template
├── generate_mermaid_figures.py  # Converts Mermaid diagrams to PNG
├── ocr_benchmark_notebook.ipynb  # Initial OCR benchmark
└── README.md

docs/ to Template Mapping

The template (plantilla_individual.pdf) requires 5 chapters. The docs/ files now match this structure exactly:

Template Section docs/ File Notes
Resumen 00_resumen.md (Spanish part) 150-300 words + Palabras clave
Abstract 00_resumen.md (English part) 150-300 words + Keywords
1. Introducción 01_introduccion.md Subsections 1.1, 1.2, 1.3
2. Contexto y estado del arte 02_contexto_estado_arte.md Subsections 2.1, 2.2, 2.3 + Mermaid diagrams
3. Objetivos y metodología 03_objetivos_metodologia.md Subsections 3.1, 3.2, 3.3, 3.4 + Mermaid diagrams
4. Desarrollo específico 04_desarrollo_especifico.md Subsections 4.1, 4.2, 4.3 + Mermaid charts
5. Conclusiones y trabajo futuro 05_conclusiones_trabajo_futuro.md Subsections 5.1, 5.2
Referencias bibliográficas 06_referencias_bibliograficas.md APA, alphabetical
Anexo A 07_anexo_a.md Repository URL + structure

Important Data Files

Results CSV Files

  • src/raytune_paddle_subproc_results_20251207_192320.csv - 64 Ray Tune trials with configs and metrics (PRIMARY DATA SOURCE)

Key Notebooks

  • src/paddle_ocr_fine_tune_unir_raytune.ipynb - Main Ray Tune experiment
  • src/prepare_dataset.ipynb - PDF to image/text conversion
  • ocr_benchmark_notebook.ipynb - EasyOCR vs PaddleOCR vs DocTR comparison

Technical Stack

Component Version
Python 3.11.9
PaddlePaddle 3.2.2
PaddleOCR 3.3.2
Ray 2.52.1
Optuna 4.6.0

Pending Work

Completed Tasks

  • Structure docs/ to match UNIR template - All chapters now follow exact numbering (1.1, 1.2, etc.)
  • Add Mermaid diagrams - 7 diagrams added (OCR pipeline, Ray Tune architecture, methodology flowcharts, CER comparison charts)
  • Generate unified thesis document - apply_content.py generates complete document from docs/
  • Convert Mermaid to PNG - generate_mermaid_figures.py generates figures automatically
  • Proper template formatting - Tables/figures use Piedefoto-tabla class, references use MsoBibliography

Priority Tasks

  1. Validate on other document types - Test optimal config on invoices, forms, contracts
  2. Expand dataset - Current dataset has only 24 pages
  3. Create presentation slides - For thesis defense
  4. Final document review - Open in Word, update indices (Ctrl+A, F9), verify formatting

Optional Extensions

  • Explore text_det_unclip_ratio parameter (was fixed at 0.0)
  • Compare with actual fine-tuning (if GPU access obtained)
  • Multi-objective optimization (CER + WER + inference time)

Thesis Document Generation

To regenerate the thesis document:

# 1. Generate PNG figures from Mermaid diagrams
python3 generate_mermaid_figures.py

# 2. Apply docs/ content to UNIR template
python3 apply_content.py

# 3. Open in Word and finalize
# - Open thesis_output/plantilla_individual.htm in Microsoft Word
# - Press Ctrl+A then F9 to update all indices
# - Save as .docx

What apply_content.py does:

  • Replaces Resumen and Abstract with actual content + keywords
  • Replaces all 5 chapters with content from docs/
  • Replaces Referencias with APA-formatted bibliography
  • Replaces Anexo with repository information
  • Converts Mermaid diagrams to embedded PNG images
  • Formats tables with Piedefoto-tabla captions and sources
  • Removes template instruction text ("Importante:", "Ejemplo de nota al pie", etc.)

UNIR TFE Document Guidelines

CRITICAL: The thesis MUST follow UNIR's official template (instructions/plantilla_individual.pdf) and guidelines (instructions/instrucciones.pdf).

Work Type Classification

This thesis is a hybrid of Type 1 (Piloto experimental) and Type 3 (Comparativa de soluciones):

  • Comparative study of OCR solutions (EasyOCR, PaddleOCR, DocTR)
  • Experimental pilot with Ray Tune hyperparameter optimization
  • 64 trials executed, results analyzed statistically

Document Structure (from plantilla_individual.pdf - MANDATORY)

The TFE must follow this EXACT structure from the official template:

Section Subsections Notes
Portada Title, Author, Type, Director, Date Use template format exactly
Resumen 150-300 words + 3-5 Palabras clave Spanish summary
Abstract 150-300 words + 3-5 Keywords English summary
Índice de contenidos Auto-generated New page
Índice de figuras Auto-generated New page
Índice de tablas Auto-generated New page
1. Introducción 1.1 Motivación, 1.2 Planteamiento del trabajo, 1.3 Estructura del trabajo 3-5 pages
2. Contexto y estado del arte 2.1 Contexto del problema, 2.2 Estado del arte, 2.3 Conclusiones 10-15 pages
3. Objetivos concretos y metodología 3.1 Objetivo general, 3.2 Objetivos específicos, 3.3 Metodología del trabajo Variable
4. Desarrollo específico Varies by work type (see below) Main content
5. Conclusiones y trabajo futuro 5.1 Conclusiones, 5.2 Líneas de trabajo futuro Variable
Referencias bibliográficas APA format, alphabetical, hanging indent Variable
Anexo A Código fuente y datos analizados Repository URL

Total length: 50-90 pages (excluding cover, resumen, abstract, indices, annexes)

Chapter-Specific Requirements (from plantilla_individual.pdf)

1. Introducción

The introduction must give a clear first idea of what was intended, the conclusions reached, and the procedure followed. Key ideas: problem identification, justification of importance, general objectives, preview of contribution.

1.1 Motivación:

  • Present the problem to solve
  • Justify importance to educational/scientific community
  • Answer: What problem? What are the causes? Why is it relevant?
  • Must include references to prior research

1.2 Planteamiento del trabajo:

  • Briefly state the problem/need detected
  • Describe the proposal and purpose
  • Answer: How to solve? What is proposed?

1.3 Estructura del trabajo:

  • Briefly describe what each subsequent chapter contains

2. Contexto y estado del arte

Study the application domain in depth, citing numerous references. Must consult different sources (not just online - also technical manuals, books).

2.1 Contexto del problema:

  • Deep study of the application domain

2.2 Estado del arte:

  • Antecedents, current studies, comparison of existing tools
  • Must reference key authors in the field (justify exclusions)

2.3 Conclusiones:

  • Summary linking research to the work to be done
  • How findings affect the specific development

3. Objetivos concretos y metodología de trabajo

Bridge between domain study and contribution. Three required elements: (1) general objective, (2) specific objectives, (3) methodology.

3.1 Objetivo general:

  • Must be SMART (Doran, 1981)
  • Focus on achieving an observable effect, not just "create a tool"
  • Example: "Mejorar el servicio X logrando Y valorado positivamente (mínimo 4/5) por Z"

3.2 Objetivos específicos:

  • Divide general objective into analyzable sub-objectives
  • Must be SMART
  • Use infinitive verbs: Analizar, Calcular, Clasificar, Comparar, Conocer, Cuantificar, Desarrollar, Describir, Descubrir, Determinar, Establecer, Explorar, Identificar, Indagar, Medir, Sintetizar, Verificar
  • Typically ~5 objectives: 1-2 about state of art, 2-3 about development

3.3 Metodología del trabajo:

  • Describe steps to achieve objectives
  • Explain WHY each step
  • What instruments will be used
  • How results will be analyzed

4. Desarrollo específico de la contribución

Structure depends on work type. Organize by methodology phases/activities.

For Type 1 (Piloto experimental):

  • 4.1 Descripción detallada del experimento
    • Technologies used (with justification)
    • How pilot was organized
    • Participants (demographics)
    • Automatic evaluation techniques
    • How experiment proceeded
    • Monitoring/evaluation instruments
    • Statistical analysis types
  • 4.2 Descripción de los resultados (objective, no interpretation)
    • Summary tables, result graphs, relevant data identification
  • 4.3 Discusión
    • Relevance of results, explanations for anomalies, highlight key findings

For Type 3 (Comparativa de soluciones):

  • 4.1 Planteamiento de la comparativa
    • Problem identification, alternative solutions to evaluate
    • Success criteria, measures to take
  • 4.2 Desarrollo de la comparativa
    • All results and measurements obtained
    • Graphs, tables, data visualization
  • 4.3 Discusión y análisis de resultados
    • Discussion of meaning, advantages/disadvantages of solutions

5. Conclusiones y trabajo futuro

5.1 Conclusiones:

  • Summary of problem, approach, and why solution is valid
  • Summary of contributions
  • Relate contributions and results to objectives - discuss degree of achievement

5.2 Líneas de trabajo futuro:

  • Future work that would add value
  • Justify how contribution can be used and in what fields

SMART Objectives Requirements

ALL objectives (general and specific) MUST be SMART:

Criterion Requirement Example from this thesis
Specific Clearly define what to achieve "Optimizar PaddleOCR para documentos en español"
Measurable Quantifiable success metric "CER < 2%"
Attainable Feasible with available resources "Sin GPU, usando optimización de hiperparámetros"
Relevant Demonstrable impact "Mejora extracción de texto en documentos académicos"
Time-bound Achievable in timeframe "Un cuatrimestre"

Citation and Reference Rules

APA Format is MANDATORY

Reference guide: https://bibliografiaycitas.unir.net/

In-text citations:

  • Single author: (Du, 2020) or Du (2020)
  • Two authors: (Du & Li, 2020)
  • Three+ authors: (Du et al., 2020)

Reference list examples:

# Journal article with DOI
Shi, B., Bai, X., & Yao, C. (2016). An end-to-end trainable neural network
  for image-based sequence recognition. IEEE Transactions on Pattern
  Analysis and Machine Intelligence, 39(11), 2298-2304.
  https://doi.org/10.1109/TPAMI.2016.2646371

# Conference paper
Akiba, T., Sano, S., Yanase, T., Ohta, T., & Koyama, M. (2019). Optuna:
  A next-generation hyperparameter optimization framework. Proceedings
  of the 25th ACM SIGKDD, 2623-2631.
  https://doi.org/10.1145/3292500.3330701

# arXiv preprint
Du, Y., Li, C., Guo, R., ... & Wang, H. (2020). PP-OCR: A practical ultra
  lightweight OCR system. arXiv preprint arXiv:2009.09941.
  https://arxiv.org/abs/2009.09941

# Software/GitHub repository
PaddlePaddle. (2024). PaddleOCR: Awesome multilingual OCR toolkits based
  on PaddlePaddle. GitHub. https://github.com/PaddlePaddle/PaddleOCR

# Book
Cohen, J. (1988). Statistical power analysis for the behavioral sciences
  (2nd ed.). Lawrence Erlbaum Associates.

Reference Rules

  • NO Wikipedia citations
  • Include variety: books, conferences, journal articles (not just URLs)
  • All cited references must appear in reference list
  • All references in list must be cited in text
  • Order alphabetically by first author's surname
  • Include DOI or URL when available

Document Formatting Rules

Page Setup

Element Specification
Page size A4
Left margin 3.0 cm
Right margin 2.0 cm
Top/Bottom margins 2.5 cm
Header Student name + TFE title
Footer Page number

Typography

Element Format
Body text Calibri 12, justified, 1.5 line spacing, 6pt before/after
Título 1 Calibri Light 18, blue, justified, 1.5 spacing
Título 2 Calibri Light 14, blue, justified, 1.5 spacing
Título 3 Calibri Light 12, justified, 1.5 spacing
Footnotes Calibri 10, justified, single spacing
Code Can reduce to 9pt if needed

Tables and Figures (from plantilla_individual.pdf)

Table format example:

Tabla 1. Ejemplo de tabla con sus principales elementos.
[TABLE CONTENT]
Fuente: American Psychological Association, 2020a.

Figure format example:

Figura 1. Ejemplo de figura realizada para nuestro trabajo.
[FIGURE]
Fuente: American Psychological Association, 2020b.

Rules:

  • Title position: Above the table/figure
  • Numbering format: "Tabla 1." / "Figura 1." (Calibri 12, bold)
  • Title text: Calibri 12, italic (after the number)
  • Source: Below, centered, format "Fuente: Author, Year."
  • Can reduce font to 9pt for dense tables
  • Can use landscape orientation for large tables
  • Tables should have horizontal lines only (no vertical lines) per APA style

Writing Style Rules

MUST DO:

  • Each chapter starts with introductory paragraph explaining content
  • Each paragraph has at least 3 sentences
  • Verify originality (cite all sources)
  • Check spelling with Word corrector
  • Ensure logical flow between paragraphs
  • Define concepts and include pertinent citations

MUST NOT DO:

  • Two consecutive headings without text between them
  • Superfluous phrases and repetition of ideas
  • Short paragraphs (less than 3 sentences)
  • Missing figure/table numbers or titles
  • Broken index generation

Annexes Requirements

Anexo A - Código fuente y datos:

  • Include repository URL where code is hosted
  • Student must be sole author and owner of repository
  • No commits from other users
  • Data used should also be in repository
  • If confidential (company project), justify why not shared

Final Submission

  • Drafts: Submit in Word format
  • Final deposit: Submit in PDF format
  • Verify all indices generate correctly before final submission

Guidelines for Claude

CRITICAL: Academic Rigor Requirements

This is a Master's Thesis. Academic rigor is NON-NEGOTIABLE.

DO NOT:

  • NEVER fabricate data or statistics - Every number must come from an actual file in this repository
  • NEVER invent comparison results - If we don't have data for EasyOCR or DocTR comparisons, don't make up numbers
  • NEVER assume or estimate values - If a metric isn't in the CSV/notebook, don't include it
  • NEVER extrapolate beyond what the data shows - 24 pages is a limited dataset, acknowledge this
  • NEVER claim results that weren't measured - Only report what was actually computed

ALWAYS:

  • Read the source file first before citing any result
  • Quote exact values from CSV files (e.g., CER 0.011535 not "approximately 1%")
  • Reference the specific file and location for every data point
  • Acknowledge limitations explicitly (dataset size, CPU-only, single document type)
  • Distinguish between measured results and interpretations

Data Sources (ONLY use these):

Data Type Source File
Ray Tune 64 trials src/raytune_paddle_subproc_results_20251207_192320.csv
Experiment code src/paddle_ocr_fine_tune_unir_raytune.ipynb
Final comparison Output cells in the notebook (baseline vs optimized)

Example of WRONG vs RIGHT:

WRONG: "EasyOCR achieved 8.5% CER while PaddleOCR achieved 5.2% CER" (We don't have this comparison data in our results files)

RIGHT: "The optimization reduced CER from 7.78% to 1.49%, a reduction of 80.9% (source: final comparison in paddle_ocr_fine_tune_unir_raytune.ipynb)"

WRONG: "The optimization improved results by approximately 80%"

RIGHT: "From the 64 trials in raytune_paddle_subproc_results_20251207_192320.csv, minimum CER achieved was 1.15%"

When Working on Documentation

  1. Read UNIR guidelines first: Check instructions/instrucciones.pdf for structure requirements

  2. Follow chapter structure: Each chapter has specific content requirements per UNIR guidelines

  3. References are UNIFIED: All references go in docs/06_referencias_bibliograficas.md, NOT per-chapter

  4. Use APA format: All citations must follow APA style

  5. Include "Fuentes de datos": Each chapter should list which repository files the data came from

  6. Language: Documentation is in Spanish (thesis requirement), code comments in English

  7. Hardware context: Remember this is CPU-only execution. Any suggestions about GPU training should acknowledge this limitation

  8. When in doubt, ask: If the user requests data that doesn't exist, ask rather than inventing numbers

  9. DIAGRAMS MUST BE IN MERMAID FORMAT: All diagrams, flowcharts, and visualizations in the documentation MUST use Mermaid syntax. This ensures:

    • Version control friendly (text-based)
    • Consistent styling across all chapters
    • Easy to edit and maintain
    • Renders properly in GitHub and most Markdown viewers

    Supported Mermaid diagram types:

    • flowchart / graph - For pipelines, workflows, architectures
    • xychart-beta - For bar charts, comparisons
    • sequenceDiagram - For process interactions
    • classDiagram - For class structures
    • stateDiagram - For state machines
    • pie - For proportional data

    Example:

    flowchart LR
        A[Input] --> B[Process] --> C[Output]
    

Common Tasks

  • Adding new experiments: Update src/paddle_ocr_fine_tune_unir_raytune.ipynb
  • Updating documentation: Edit files in docs/
  • Adding references: Add to docs/06_referencias_bibliograficas.md (unified list)
  • Dataset expansion: Use src/prepare_dataset.ipynb as template
  • Running evaluations: Use src/paddle_ocr_tuning.py CLI

Experiment Details

Ray Tune Configuration

tuner = tune.Tuner(
    trainable_paddle_ocr,
    tune_config=tune.TuneConfig(
        metric="CER",
        mode="min",
        search_alg=OptunaSearch(),
        num_samples=64,
        max_concurrent_trials=2
    )
)

Dataset

  • Source: UNIR TFE instructions PDF
  • Pages: 24
  • Resolution: 300 DPI
  • Ground truth: Extracted via PyMuPDF

Metrics

  • CER (Character Error Rate) - Primary metric
  • WER (Word Error Rate) - Secondary metric
  • Calculated using jiwer library