Files
MastersThesis/claude.md
2025-12-10 22:34:33 +01:00

19 KiB

Claude Code Context - Masters Thesis OCR Project

Project Overview

This is a Master's Thesis (TFM) for UNIR's Master in Artificial Intelligence. The project focuses on OCR hyperparameter optimization using Ray Tune with Optuna for Spanish academic documents.

Author: Sergio Jiménez Jiménez University: UNIR (Universidad Internacional de La Rioja) Year: 2025

Key Context

Why Hyperparameter Optimization Instead of Fine-tuning

Due to hardware limitations (no dedicated GPU, CPU-only execution), the project pivoted from fine-tuning to hyperparameter optimization:

  • Fine-tuning deep learning models without GPU is prohibitively slow
  • Inference time is ~69 seconds/page on CPU
  • Hyperparameter optimization proved to be an effective alternative, achieving 80.9% CER reduction

Main Results

Model CER Character Accuracy
PaddleOCR Baseline 7.78% 92.22%
PaddleOCR-HyperAdjust 1.49% 98.51%

Goal achieved: CER < 2% (target was < 2%, result is 1.49%)

Optimal Configuration Found

config_optimizada = {
    "textline_orientation": True,           # CRITICAL - reduces CER ~70%
    "use_doc_orientation_classify": False,
    "use_doc_unwarping": False,
    "text_det_thresh": 0.4690,
    "text_det_box_thresh": 0.5412,
    "text_det_unclip_ratio": 0.0,
    "text_rec_score_thresh": 0.6350,
}

Key Findings

  1. textline_orientation=True is the most impactful parameter (reduces CER by 69.7%)
  2. text_det_thresh has -0.52 correlation with CER; values < 0.1 cause catastrophic failures
  3. Document correction modules (use_doc_orientation_classify, use_doc_unwarping) are unnecessary for digital PDFs

Repository Structure

MastersThesis/
├── docs/                    # Thesis chapters in Markdown (matches template structure)
│   ├── 00_resumen.md        # Resumen + Abstract
│   ├── 01_introduccion.md   # Chapter 1: Introducción
│   ├── 02_contexto_estado_arte.md  # Chapter 2: Contexto y estado del arte
│   ├── 03_objetivos_metodologia.md # Chapter 3: Objetivos y metodología
│   ├── 04_desarrollo_especifico.md # Chapter 4: Desarrollo específico (4.1, 4.2, 4.3)
│   ├── 05_conclusiones_trabajo_futuro.md # Chapter 5: Conclusiones
│   └── 06_referencias_bibliograficas.md  # Referencias bibliográficas
├── src/
│   ├── paddle_ocr_fine_tune_unir_raytune.ipynb  # Main experiment (64 trials)
│   ├── paddle_ocr_tuning.py                      # CLI evaluation script
│   ├── dataset_manager.py                        # ImageTextDataset class
│   ├── prepare_dataset.ipynb                     # Dataset preparation
│   └── raytune_paddle_subproc_results_20251207_192320.csv  # 64 trial results
├── results/                 # Benchmark results CSVs
├── instructions/            # UNIR instructions and template
│   ├── instrucciones.pdf    # TFE writing guidelines
│   └── plantilla_individual.pdf  # Word template (PDF version)
├── ocr_benchmark_notebook.ipynb  # Initial OCR benchmark
└── README.md

docs/ to Template Mapping

The template (plantilla_individual.pdf) requires 5 chapters. The docs/ files now match this structure exactly:

Template Section docs/ File Notes
Resumen 00_resumen.md (Spanish part) 150-300 words
Abstract 00_resumen.md (English part) 150-300 words
1. Introducción 01_introduccion.md Subsections 1.1, 1.2, 1.3
2. Contexto y estado del arte 02_contexto_estado_arte.md Subsections 2.1, 2.2, 2.3
3. Objetivos y metodología 03_objetivos_metodologia.md Subsections 3.1, 3.2, 3.3
4. Desarrollo específico 04_desarrollo_especifico.md Includes 4.1, 4.2, 4.3
5. Conclusiones y trabajo futuro 05_conclusiones_trabajo_futuro.md Subsections 5.1, 5.2
Referencias bibliográficas 06_referencias_bibliograficas.md APA, alphabetical
Anexo A (create from README) Repository URL

Important Data Files

Results CSV Files

  • src/raytune_paddle_subproc_results_20251207_192320.csv - 64 Ray Tune trials with configs and metrics
  • results/ai_ocr_benchmark_finetune_results_20251206_113206.csv - Per-page OCR benchmark results

Key Notebooks

  • src/paddle_ocr_fine_tune_unir_raytune.ipynb - Main Ray Tune experiment
  • src/prepare_dataset.ipynb - PDF to image/text conversion
  • ocr_benchmark_notebook.ipynb - EasyOCR vs PaddleOCR vs DocTR comparison

Technical Stack

Component Version
Python 3.11.9
PaddlePaddle 3.2.2
PaddleOCR 3.3.2
Ray 2.52.1
Optuna 4.6.0

Pending Work

Priority Tasks

  1. Validate on other document types - Test optimal config on invoices, forms, contracts
  2. Expand dataset - Current dataset has only 24 pages
  3. Complete unified thesis document - Merge docs/ chapters into final UNIR format
  4. Create presentation slides - For thesis defense

Optional Extensions

  • Explore text_det_unclip_ratio parameter (was fixed at 0.0)
  • Compare with actual fine-tuning (if GPU access obtained)
  • Multi-objective optimization (CER + WER + inference time)

UNIR TFE Document Guidelines

CRITICAL: The thesis MUST follow UNIR's official template (instructions/plantilla_individual.pdf) and guidelines (instructions/instrucciones.pdf).

Work Type Classification

This thesis is a hybrid of Type 1 (Piloto experimental) and Type 3 (Comparativa de soluciones):

  • Comparative study of OCR solutions (EasyOCR, PaddleOCR, DocTR)
  • Experimental pilot with Ray Tune hyperparameter optimization
  • 64 trials executed, results analyzed statistically

Document Structure (from plantilla_individual.pdf - MANDATORY)

The TFE must follow this EXACT structure from the official template:

Section Subsections Notes
Portada Title, Author, Type, Director, Date Use template format exactly
Resumen 150-300 words + 3-5 Palabras clave Spanish summary
Abstract 150-300 words + 3-5 Keywords English summary
Índice de contenidos Auto-generated New page
Índice de figuras Auto-generated New page
Índice de tablas Auto-generated New page
1. Introducción 1.1 Motivación, 1.2 Planteamiento del trabajo, 1.3 Estructura del trabajo 3-5 pages
2. Contexto y estado del arte 2.1 Contexto del problema, 2.2 Estado del arte, 2.3 Conclusiones 10-15 pages
3. Objetivos concretos y metodología 3.1 Objetivo general, 3.2 Objetivos específicos, 3.3 Metodología del trabajo Variable
4. Desarrollo específico Varies by work type (see below) Main content
5. Conclusiones y trabajo futuro 5.1 Conclusiones, 5.2 Líneas de trabajo futuro Variable
Referencias bibliográficas APA format, alphabetical, hanging indent Variable
Anexo A Código fuente y datos analizados Repository URL

Total length: 50-90 pages (excluding cover, resumen, abstract, indices, annexes)

Chapter-Specific Requirements (from plantilla_individual.pdf)

1. Introducción

The introduction must give a clear first idea of what was intended, the conclusions reached, and the procedure followed. Key ideas: problem identification, justification of importance, general objectives, preview of contribution.

1.1 Motivación:

  • Present the problem to solve
  • Justify importance to educational/scientific community
  • Answer: What problem? What are the causes? Why is it relevant?
  • Must include references to prior research

1.2 Planteamiento del trabajo:

  • Briefly state the problem/need detected
  • Describe the proposal and purpose
  • Answer: How to solve? What is proposed?

1.3 Estructura del trabajo:

  • Briefly describe what each subsequent chapter contains

2. Contexto y estado del arte

Study the application domain in depth, citing numerous references. Must consult different sources (not just online - also technical manuals, books).

2.1 Contexto del problema:

  • Deep study of the application domain

2.2 Estado del arte:

  • Antecedents, current studies, comparison of existing tools
  • Must reference key authors in the field (justify exclusions)

2.3 Conclusiones:

  • Summary linking research to the work to be done
  • How findings affect the specific development

3. Objetivos concretos y metodología de trabajo

Bridge between domain study and contribution. Three required elements: (1) general objective, (2) specific objectives, (3) methodology.

3.1 Objetivo general:

  • Must be SMART (Doran, 1981)
  • Focus on achieving an observable effect, not just "create a tool"
  • Example: "Mejorar el servicio X logrando Y valorado positivamente (mínimo 4/5) por Z"

3.2 Objetivos específicos:

  • Divide general objective into analyzable sub-objectives
  • Must be SMART
  • Use infinitive verbs: Analizar, Calcular, Clasificar, Comparar, Conocer, Cuantificar, Desarrollar, Describir, Descubrir, Determinar, Establecer, Explorar, Identificar, Indagar, Medir, Sintetizar, Verificar
  • Typically ~5 objectives: 1-2 about state of art, 2-3 about development

3.3 Metodología del trabajo:

  • Describe steps to achieve objectives
  • Explain WHY each step
  • What instruments will be used
  • How results will be analyzed

4. Desarrollo específico de la contribución

Structure depends on work type. Organize by methodology phases/activities.

For Type 1 (Piloto experimental):

  • 4.1 Descripción detallada del experimento
    • Technologies used (with justification)
    • How pilot was organized
    • Participants (demographics)
    • Automatic evaluation techniques
    • How experiment proceeded
    • Monitoring/evaluation instruments
    • Statistical analysis types
  • 4.2 Descripción de los resultados (objective, no interpretation)
    • Summary tables, result graphs, relevant data identification
  • 4.3 Discusión
    • Relevance of results, explanations for anomalies, highlight key findings

For Type 3 (Comparativa de soluciones):

  • 4.1 Planteamiento de la comparativa
    • Problem identification, alternative solutions to evaluate
    • Success criteria, measures to take
  • 4.2 Desarrollo de la comparativa
    • All results and measurements obtained
    • Graphs, tables, data visualization
  • 4.3 Discusión y análisis de resultados
    • Discussion of meaning, advantages/disadvantages of solutions

5. Conclusiones y trabajo futuro

5.1 Conclusiones:

  • Summary of problem, approach, and why solution is valid
  • Summary of contributions
  • Relate contributions and results to objectives - discuss degree of achievement

5.2 Líneas de trabajo futuro:

  • Future work that would add value
  • Justify how contribution can be used and in what fields

SMART Objectives Requirements

ALL objectives (general and specific) MUST be SMART:

Criterion Requirement Example from this thesis
Specific Clearly define what to achieve "Optimizar PaddleOCR para documentos en español"
Measurable Quantifiable success metric "CER < 2%"
Attainable Feasible with available resources "Sin GPU, usando optimización de hiperparámetros"
Relevant Demonstrable impact "Mejora extracción de texto en documentos académicos"
Time-bound Achievable in timeframe "Un cuatrimestre"

Citation and Reference Rules

APA Format is MANDATORY

Reference guide: https://bibliografiaycitas.unir.net/

In-text citations:

  • Single author: (Du, 2020) or Du (2020)
  • Two authors: (Du & Li, 2020)
  • Three+ authors: (Du et al., 2020)

Reference list examples:

# Journal article with DOI
Shi, B., Bai, X., & Yao, C. (2016). An end-to-end trainable neural network
  for image-based sequence recognition. IEEE Transactions on Pattern
  Analysis and Machine Intelligence, 39(11), 2298-2304.
  https://doi.org/10.1109/TPAMI.2016.2646371

# Conference paper
Akiba, T., Sano, S., Yanase, T., Ohta, T., & Koyama, M. (2019). Optuna:
  A next-generation hyperparameter optimization framework. Proceedings
  of the 25th ACM SIGKDD, 2623-2631.
  https://doi.org/10.1145/3292500.3330701

# arXiv preprint
Du, Y., Li, C., Guo, R., ... & Wang, H. (2020). PP-OCR: A practical ultra
  lightweight OCR system. arXiv preprint arXiv:2009.09941.
  https://arxiv.org/abs/2009.09941

# Software/GitHub repository
PaddlePaddle. (2024). PaddleOCR: Awesome multilingual OCR toolkits based
  on PaddlePaddle. GitHub. https://github.com/PaddlePaddle/PaddleOCR

# Book
Cohen, J. (1988). Statistical power analysis for the behavioral sciences
  (2nd ed.). Lawrence Erlbaum Associates.

Reference Rules

  • NO Wikipedia citations
  • Include variety: books, conferences, journal articles (not just URLs)
  • All cited references must appear in reference list
  • All references in list must be cited in text
  • Order alphabetically by first author's surname
  • Include DOI or URL when available

Document Formatting Rules

Page Setup

Element Specification
Page size A4
Left margin 3.0 cm
Right margin 2.0 cm
Top/Bottom margins 2.5 cm
Header Student name + TFE title
Footer Page number

Typography

Element Format
Body text Calibri 12, justified, 1.5 line spacing, 6pt before/after
Título 1 Calibri Light 18, blue, justified, 1.5 spacing
Título 2 Calibri Light 14, blue, justified, 1.5 spacing
Título 3 Calibri Light 12, justified, 1.5 spacing
Footnotes Calibri 10, justified, single spacing
Code Can reduce to 9pt if needed

Tables and Figures (from plantilla_individual.pdf)

Table format example:

Tabla 1. Ejemplo de tabla con sus principales elementos.
[TABLE CONTENT]
Fuente: American Psychological Association, 2020a.

Figure format example:

Figura 1. Ejemplo de figura realizada para nuestro trabajo.
[FIGURE]
Fuente: American Psychological Association, 2020b.

Rules:

  • Title position: Above the table/figure
  • Numbering format: "Tabla 1." / "Figura 1." (Calibri 12, bold)
  • Title text: Calibri 12, italic (after the number)
  • Source: Below, centered, format "Fuente: Author, Year."
  • Can reduce font to 9pt for dense tables
  • Can use landscape orientation for large tables
  • Tables should have horizontal lines only (no vertical lines) per APA style

Writing Style Rules

MUST DO:

  • Each chapter starts with introductory paragraph explaining content
  • Each paragraph has at least 3 sentences
  • Verify originality (cite all sources)
  • Check spelling with Word corrector
  • Ensure logical flow between paragraphs
  • Define concepts and include pertinent citations

MUST NOT DO:

  • Two consecutive headings without text between them
  • Superfluous phrases and repetition of ideas
  • Short paragraphs (less than 3 sentences)
  • Missing figure/table numbers or titles
  • Broken index generation

Annexes Requirements

Anexo A - Código fuente y datos:

  • Include repository URL where code is hosted
  • Student must be sole author and owner of repository
  • No commits from other users
  • Data used should also be in repository
  • If confidential (company project), justify why not shared

Final Submission

  • Drafts: Submit in Word format
  • Final deposit: Submit in PDF format
  • Verify all indices generate correctly before final submission

Guidelines for Claude

CRITICAL: Academic Rigor Requirements

This is a Master's Thesis. Academic rigor is NON-NEGOTIABLE.

DO NOT:

  • NEVER fabricate data or statistics - Every number must come from an actual file in this repository
  • NEVER invent comparison results - If we don't have data for EasyOCR or DocTR comparisons, don't make up numbers
  • NEVER assume or estimate values - If a metric isn't in the CSV/notebook, don't include it
  • NEVER extrapolate beyond what the data shows - 24 pages is a limited dataset, acknowledge this
  • NEVER claim results that weren't measured - Only report what was actually computed

ALWAYS:

  • Read the source file first before citing any result
  • Quote exact values from CSV files (e.g., CER 0.011535 not "approximately 1%")
  • Reference the specific file and location for every data point
  • Acknowledge limitations explicitly (dataset size, CPU-only, single document type)
  • Distinguish between measured results and interpretations

Data Sources (ONLY use these):

Data Type Source File
Ray Tune 64 trials src/raytune_paddle_subproc_results_20251207_192320.csv
Per-page benchmark results/ai_ocr_benchmark_finetune_results_20251206_113206.csv
Experiment code src/paddle_ocr_fine_tune_unir_raytune.ipynb
Final comparison Output cells in the notebook (baseline vs optimized)

Example of WRONG vs RIGHT:

WRONG: "EasyOCR achieved 8.5% CER while PaddleOCR achieved 5.2% CER" (We don't have this comparison data in our results files)

RIGHT: "PaddleOCR with baseline configuration achieved CER between 1.54% and 6.40% across pages 5-9 (source: results/ai_ocr_benchmark_finetune_results_20251206_113206.csv)"

WRONG: "The optimization improved results by approximately 80%"

RIGHT: "The optimization reduced CER from 7.78% to 1.49%, a reduction of 80.9% (source: final comparison in paddle_ocr_fine_tune_unir_raytune.ipynb)"

When Working on Documentation

  1. Read UNIR guidelines first: Check instructions/instrucciones.pdf for structure requirements

  2. Follow chapter structure: Each chapter has specific content requirements per UNIR guidelines

  3. References are UNIFIED: All references go in docs/06_referencias_bibliograficas.md, NOT per-chapter

  4. Use APA format: All citations must follow APA style

  5. Include "Fuentes de datos": Each chapter should list which repository files the data came from

  6. Language: Documentation is in Spanish (thesis requirement), code comments in English

  7. Hardware context: Remember this is CPU-only execution. Any suggestions about GPU training should acknowledge this limitation

  8. When in doubt, ask: If the user requests data that doesn't exist, ask rather than inventing numbers

Common Tasks

  • Adding new experiments: Update src/paddle_ocr_fine_tune_unir_raytune.ipynb
  • Updating documentation: Edit files in docs/
  • Adding references: Add to docs/06_referencias_bibliograficas.md (unified list)
  • Dataset expansion: Use src/prepare_dataset.ipynb as template
  • Running evaluations: Use src/paddle_ocr_tuning.py CLI

Experiment Details

Ray Tune Configuration

tuner = tune.Tuner(
    trainable_paddle_ocr,
    tune_config=tune.TuneConfig(
        metric="CER",
        mode="min",
        search_alg=OptunaSearch(),
        num_samples=64,
        max_concurrent_trials=2
    )
)

Dataset

  • Source: UNIR TFE instructions PDF
  • Pages: 24
  • Resolution: 300 DPI
  • Ground truth: Extracted via PyMuPDF

Metrics

  • CER (Character Error Rate) - Primary metric
  • WER (Word Error Rate) - Secondary metric
  • Calculated using jiwer library