Files

sergio 57df34ac5a deliberable_16_12_2025

2025-12-16 00:54:36 +01:00

22 KiB

Raw Blame History

Claude Code Context - Masters Thesis OCR Project

Project Overview

This is a Master's Thesis (TFM) for UNIR's Master in Artificial Intelligence. The project focuses on OCR hyperparameter optimization using Ray Tune with Optuna for Spanish academic documents.

Author: Sergio Jiménez Jiménez University: UNIR (Universidad Internacional de La Rioja) Year: 2025

Key Context

Why Hyperparameter Optimization Instead of Fine-tuning

Due to hardware limitations (no dedicated GPU, CPU-only execution), the project pivoted from fine-tuning to hyperparameter optimization:

Fine-tuning deep learning models without GPU is prohibitively slow
Inference time is ~69 seconds/page on CPU
Hyperparameter optimization proved to be an effective alternative, achieving 80.9% CER reduction

Main Results

Model	CER	Character Accuracy
PaddleOCR Baseline	7.78%	92.22%
PaddleOCR-HyperAdjust	1.49%	98.51%

Goal achieved: CER < 2% (target was < 2%, result is 1.49%)

Optimal Configuration Found

config_optimizada = {
    "textline_orientation": True,           # CRITICAL - reduces CER ~70%
    "use_doc_orientation_classify": False,
    "use_doc_unwarping": False,
    "text_det_thresh": 0.4690,
    "text_det_box_thresh": 0.5412,
    "text_det_unclip_ratio": 0.0,
    "text_rec_score_thresh": 0.6350,
}

Key Findings

textline_orientation=True is the most impactful parameter (reduces CER by 69.7%)
text_det_thresh has -0.52 correlation with CER; values < 0.1 cause catastrophic failures
Document correction modules (use_doc_orientation_classify, use_doc_unwarping) are unnecessary for digital PDFs

Repository Structure

MastersThesis/
├── docs/                    # Thesis chapters in Markdown (UNIR template structure)
│   ├── 00_resumen.md                      # Resumen + Abstract + Keywords
│   ├── 01_introduccion.md                 # 1. Introducción (1.1, 1.2, 1.3)
│   ├── 02_contexto_estado_arte.md         # 2. Contexto y estado del arte (2.1, 2.2, 2.3)
│   ├── 03_objetivos_metodologia.md        # 3. Objetivos y metodología (3.1, 3.2, 3.3, 3.4)
│   ├── 04_desarrollo_especifico.md        # 4. Desarrollo específico (4.1, 4.2, 4.3)
│   ├── 05_conclusiones_trabajo_futuro.md  # 5. Conclusiones (5.1, 5.2)
│   ├── 06_referencias_bibliograficas.md   # Referencias bibliográficas (APA format)
│   └── 07_anexo_a.md                      # Anexo A: Código fuente y datos
├── thesis_output/           # Generated thesis document
│   ├── plantilla_individual.htm           # Complete TFM (open in Word)
│   └── figures/                           # PNG figures from Mermaid diagrams
│       ├── figura_1.png ... figura_7.png
│       └── figures_manifest.json
├── src/
│   ├── paddle_ocr_fine_tune_unir_raytune.ipynb  # Main experiment (64 trials)
│   ├── paddle_ocr_tuning.py                      # CLI evaluation script
│   ├── dataset_manager.py                        # ImageTextDataset class
│   ├── prepare_dataset.ipynb                     # Dataset preparation
│   └── raytune_paddle_subproc_results_20251207_192320.csv  # 64 trial results
├── results/                 # Benchmark results CSVs
├── instructions/            # UNIR instructions and template
│   ├── instrucciones.pdf         # TFE writing guidelines
│   ├── plantilla_individual.pdf  # Word template (PDF version)
│   └── plantilla_individual.htm  # Word template (HTML version, source)
├── apply_content.py         # Generates TFM document from docs/ + template
├── generate_mermaid_figures.py  # Converts Mermaid diagrams to PNG
├── ocr_benchmark_notebook.ipynb  # Initial OCR benchmark
└── README.md

docs/ to Template Mapping

The template (plantilla_individual.pdf) requires 5 chapters. The docs/ files now match this structure exactly:

Template Section	docs/ File	Notes
Resumen	`00_resumen.md` (Spanish part)	150-300 words + Palabras clave
Abstract	`00_resumen.md` (English part)	150-300 words + Keywords
1. Introducción	`01_introduccion.md`	Subsections 1.1, 1.2, 1.3
2. Contexto y estado del arte	`02_contexto_estado_arte.md`	Subsections 2.1, 2.2, 2.3 + Mermaid diagrams
3. Objetivos y metodología	`03_objetivos_metodologia.md`	Subsections 3.1, 3.2, 3.3, 3.4 + Mermaid diagrams
4. Desarrollo específico	`04_desarrollo_especifico.md`	Subsections 4.1, 4.2, 4.3 + Mermaid charts
5. Conclusiones y trabajo futuro	`05_conclusiones_trabajo_futuro.md`	Subsections 5.1, 5.2
Referencias bibliográficas	`06_referencias_bibliograficas.md`	APA, alphabetical
Anexo A	`07_anexo_a.md`	Repository URL + structure

Important Data Files

Results CSV Files

src/raytune_paddle_subproc_results_20251207_192320.csv - 64 Ray Tune trials with configs and metrics (PRIMARY DATA SOURCE)

Key Notebooks

src/paddle_ocr_fine_tune_unir_raytune.ipynb - Main Ray Tune experiment
src/prepare_dataset.ipynb - PDF to image/text conversion
ocr_benchmark_notebook.ipynb - EasyOCR vs PaddleOCR vs DocTR comparison

Technical Stack

Component	Version
Python	3.11.9
PaddlePaddle	3.2.2
PaddleOCR	3.3.2
Ray	2.52.1
Optuna	4.6.0

Pending Work

Completed Tasks

Structure docs/ to match UNIR template - All chapters now follow exact numbering (1.1, 1.2, etc.)
Add Mermaid diagrams - 7 diagrams added (OCR pipeline, Ray Tune architecture, methodology flowcharts, CER comparison charts)
Generate unified thesis document - apply_content.py generates complete document from docs/
Convert Mermaid to PNG - generate_mermaid_figures.py generates figures automatically
Proper template formatting - Tables/figures use Piedefoto-tabla class, references use MsoBibliography

Priority Tasks

Validate on other document types - Test optimal config on invoices, forms, contracts
Expand dataset - Current dataset has only 24 pages
Create presentation slides - For thesis defense
Final document review - Open in Word, update indices (Ctrl+A, F9), verify formatting

Optional Extensions

Explore text_det_unclip_ratio parameter (was fixed at 0.0)
Compare with actual fine-tuning (if GPU access obtained)
Multi-objective optimization (CER + WER + inference time)

Thesis Document Generation

To regenerate the thesis document:

# 1. Generate PNG figures from Mermaid diagrams
python3 generate_mermaid_figures.py

# 2. Apply docs/ content to UNIR template
python3 apply_content.py

# 3. Open in Word and finalize
# - Open thesis_output/plantilla_individual.htm in Microsoft Word
# - Press Ctrl+A then F9 to update all indices
# - Save as .docx

What apply_content.py does:

Replaces Resumen and Abstract with actual content + keywords
Replaces all 5 chapters with content from docs/
Replaces Referencias with APA-formatted bibliography
Replaces Anexo with repository information
Converts Mermaid diagrams to embedded PNG images
Formats tables with Piedefoto-tabla captions and sources
Removes template instruction text ("Importante:", "Ejemplo de nota al pie", etc.)

UNIR TFE Document Guidelines

CRITICAL: The thesis MUST follow UNIR's official template (instructions/plantilla_individual.pdf) and guidelines (instructions/instrucciones.pdf).

Work Type Classification

This thesis is a hybrid of Type 1 (Piloto experimental) and Type 3 (Comparativa de soluciones):

Comparative study of OCR solutions (EasyOCR, PaddleOCR, DocTR)
Experimental pilot with Ray Tune hyperparameter optimization
64 trials executed, results analyzed statistically

Document Structure (from plantilla_individual.pdf - MANDATORY)

The TFE must follow this EXACT structure from the official template:

Section	Subsections	Notes
Portada	Title, Author, Type, Director, Date	Use template format exactly
Resumen	150-300 words + 3-5 Palabras clave	Spanish summary
Abstract	150-300 words + 3-5 Keywords	English summary
Índice de contenidos	Auto-generated	New page
Índice de figuras	Auto-generated	New page
Índice de tablas	Auto-generated	New page
1. Introducción	1.1 Motivación, 1.2 Planteamiento del trabajo, 1.3 Estructura del trabajo	3-5 pages
2. Contexto y estado del arte	2.1 Contexto del problema, 2.2 Estado del arte, 2.3 Conclusiones	10-15 pages
3. Objetivos concretos y metodología	3.1 Objetivo general, 3.2 Objetivos específicos, 3.3 Metodología del trabajo	Variable
4. Desarrollo específico	Varies by work type (see below)	Main content
5. Conclusiones y trabajo futuro	5.1 Conclusiones, 5.2 Líneas de trabajo futuro	Variable
Referencias bibliográficas	APA format, alphabetical, hanging indent	Variable
Anexo A	Código fuente y datos analizados	Repository URL

Total length: 50-90 pages (excluding cover, resumen, abstract, indices, annexes)

Chapter-Specific Requirements (from plantilla_individual.pdf)

1. Introducción

The introduction must give a clear first idea of what was intended, the conclusions reached, and the procedure followed. Key ideas: problem identification, justification of importance, general objectives, preview of contribution.

1.1 Motivación:

Present the problem to solve
Justify importance to educational/scientific community
Answer: What problem? What are the causes? Why is it relevant?
Must include references to prior research

1.2 Planteamiento del trabajo:

Briefly state the problem/need detected
Describe the proposal and purpose
Answer: How to solve? What is proposed?

1.3 Estructura del trabajo:

Briefly describe what each subsequent chapter contains

2. Contexto y estado del arte

Study the application domain in depth, citing numerous references. Must consult different sources (not just online - also technical manuals, books).

2.1 Contexto del problema:

Deep study of the application domain

2.2 Estado del arte:

Antecedents, current studies, comparison of existing tools
Must reference key authors in the field (justify exclusions)

2.3 Conclusiones:

Summary linking research to the work to be done
How findings affect the specific development

3. Objetivos concretos y metodología de trabajo

Bridge between domain study and contribution. Three required elements: (1) general objective, (2) specific objectives, (3) methodology.

3.1 Objetivo general:

Must be SMART (Doran, 1981)
Focus on achieving an observable effect, not just "create a tool"
Example: "Mejorar el servicio X logrando Y valorado positivamente (mínimo 4/5) por Z"

3.2 Objetivos específicos:

Divide general objective into analyzable sub-objectives
Must be SMART
Use infinitive verbs: Analizar, Calcular, Clasificar, Comparar, Conocer, Cuantificar, Desarrollar, Describir, Descubrir, Determinar, Establecer, Explorar, Identificar, Indagar, Medir, Sintetizar, Verificar
Typically ~5 objectives: 1-2 about state of art, 2-3 about development

3.3 Metodología del trabajo:

Describe steps to achieve objectives
Explain WHY each step
What instruments will be used
How results will be analyzed

4. Desarrollo específico de la contribución

Structure depends on work type. Organize by methodology phases/activities.

For Type 1 (Piloto experimental):

4.1 Descripción detallada del experimento
- Technologies used (with justification)
- How pilot was organized
- Participants (demographics)
- Automatic evaluation techniques
- How experiment proceeded
- Monitoring/evaluation instruments
- Statistical analysis types
4.2 Descripción de los resultados (objective, no interpretation)
- Summary tables, result graphs, relevant data identification
4.3 Discusión
- Relevance of results, explanations for anomalies, highlight key findings

For Type 3 (Comparativa de soluciones):

4.1 Planteamiento de la comparativa
- Problem identification, alternative solutions to evaluate
- Success criteria, measures to take
4.2 Desarrollo de la comparativa
- All results and measurements obtained
- Graphs, tables, data visualization
4.3 Discusión y análisis de resultados
- Discussion of meaning, advantages/disadvantages of solutions

5. Conclusiones y trabajo futuro

5.1 Conclusiones:

Summary of problem, approach, and why solution is valid
Summary of contributions
Relate contributions and results to objectives - discuss degree of achievement

5.2 Líneas de trabajo futuro:

Future work that would add value
Justify how contribution can be used and in what fields

SMART Objectives Requirements

ALL objectives (general and specific) MUST be SMART:

Criterion	Requirement	Example from this thesis
Specific	Clearly define what to achieve	"Optimizar PaddleOCR para documentos en español"
Measurable	Quantifiable success metric	"CER < 2%"
Attainable	Feasible with available resources	"Sin GPU, usando optimización de hiperparámetros"
Relevant	Demonstrable impact	"Mejora extracción de texto en documentos académicos"
Time-bound	Achievable in timeframe	"Un cuatrimestre"

Citation and Reference Rules

APA Format is MANDATORY

Reference guide: https://bibliografiaycitas.unir.net/

In-text citations:

Single author: (Du, 2020) or Du (2020)
Two authors: (Du & Li, 2020)
Three+ authors: (Du et al., 2020)

Reference list examples:

# Journal article with DOI
Shi, B., Bai, X., & Yao, C. (2016). An end-to-end trainable neural network
  for image-based sequence recognition. IEEE Transactions on Pattern
  Analysis and Machine Intelligence, 39(11), 2298-2304.
  https://doi.org/10.1109/TPAMI.2016.2646371

# Conference paper
Akiba, T., Sano, S., Yanase, T., Ohta, T., & Koyama, M. (2019). Optuna:
  A next-generation hyperparameter optimization framework. Proceedings
  of the 25th ACM SIGKDD, 2623-2631.
  https://doi.org/10.1145/3292500.3330701

# arXiv preprint
Du, Y., Li, C., Guo, R., ... & Wang, H. (2020). PP-OCR: A practical ultra
  lightweight OCR system. arXiv preprint arXiv:2009.09941.
  https://arxiv.org/abs/2009.09941

# Software/GitHub repository
PaddlePaddle. (2024). PaddleOCR: Awesome multilingual OCR toolkits based
  on PaddlePaddle. GitHub. https://github.com/PaddlePaddle/PaddleOCR

# Book
Cohen, J. (1988). Statistical power analysis for the behavioral sciences
  (2nd ed.). Lawrence Erlbaum Associates.

Reference Rules

NO Wikipedia citations
Include variety: books, conferences, journal articles (not just URLs)
All cited references must appear in reference list
All references in list must be cited in text
Order alphabetically by first author's surname
Include DOI or URL when available

Document Formatting Rules

Page Setup

Element	Specification
Page size	A4
Left margin	3.0 cm
Right margin	2.0 cm
Top/Bottom margins	2.5 cm
Header	Student name + TFE title
Footer	Page number

Typography

Element	Format
Body text	Calibri 12, justified, 1.5 line spacing, 6pt before/after
Título 1	Calibri Light 18, blue, justified, 1.5 spacing
Título 2	Calibri Light 14, blue, justified, 1.5 spacing
Título 3	Calibri Light 12, justified, 1.5 spacing
Footnotes	Calibri 10, justified, single spacing
Code	Can reduce to 9pt if needed

Tables and Figures (from plantilla_individual.pdf)

Table format example:

Tabla 1. Ejemplo de tabla con sus principales elementos.
[TABLE CONTENT]
Fuente: American Psychological Association, 2020a.

Figure format example:

Figura 1. Ejemplo de figura realizada para nuestro trabajo.
[FIGURE]
Fuente: American Psychological Association, 2020b.

Rules:

Title position: Above the table/figure
Numbering format: "Tabla 1." / "Figura 1." (Calibri 12, bold)
Title text: Calibri 12, italic (after the number)
Source: Below, centered, format "Fuente: Author, Year."
Can reduce font to 9pt for dense tables
Can use landscape orientation for large tables
Tables should have horizontal lines only (no vertical lines) per APA style

Writing Style Rules

MUST DO:

Each chapter starts with introductory paragraph explaining content
Each paragraph has at least 3 sentences
Verify originality (cite all sources)
Check spelling with Word corrector
Ensure logical flow between paragraphs
Define concepts and include pertinent citations

MUST NOT DO:

Two consecutive headings without text between them
Superfluous phrases and repetition of ideas
Short paragraphs (less than 3 sentences)
Missing figure/table numbers or titles
Broken index generation

Annexes Requirements

Anexo A - Código fuente y datos:

Include repository URL where code is hosted
Student must be sole author and owner of repository
No commits from other users
Data used should also be in repository
If confidential (company project), justify why not shared

Final Submission

Drafts: Submit in Word format
Final deposit: Submit in PDF format
Verify all indices generate correctly before final submission

Guidelines for Claude

CRITICAL: Academic Rigor Requirements

This is a Master's Thesis. Academic rigor is NON-NEGOTIABLE.

DO NOT:

NEVER fabricate data or statistics - Every number must come from an actual file in this repository
NEVER invent comparison results - If we don't have data for EasyOCR or DocTR comparisons, don't make up numbers
NEVER assume or estimate values - If a metric isn't in the CSV/notebook, don't include it
NEVER extrapolate beyond what the data shows - 24 pages is a limited dataset, acknowledge this
NEVER claim results that weren't measured - Only report what was actually computed

ALWAYS:

Read the source file first before citing any result
Quote exact values from CSV files (e.g., CER 0.011535 not "approximately 1%")
Reference the specific file and location for every data point
Acknowledge limitations explicitly (dataset size, CPU-only, single document type)
Distinguish between measured results and interpretations

Data Sources (ONLY use these):

Data Type	Source File
Ray Tune 64 trials	`src/raytune_paddle_subproc_results_20251207_192320.csv`
Experiment code	`src/paddle_ocr_fine_tune_unir_raytune.ipynb`
Final comparison	Output cells in the notebook (baseline vs optimized)

Example of WRONG vs RIGHT:

WRONG: "EasyOCR achieved 8.5% CER while PaddleOCR achieved 5.2% CER" (We don't have this comparison data in our results files)

RIGHT: "The optimization reduced CER from 7.78% to 1.49%, a reduction of 80.9% (source: final comparison in paddle_ocr_fine_tune_unir_raytune.ipynb)"

WRONG: "The optimization improved results by approximately 80%"

RIGHT: "From the 64 trials in raytune_paddle_subproc_results_20251207_192320.csv, minimum CER achieved was 1.15%"

When Working on Documentation

Read UNIR guidelines first: Check instructions/instrucciones.pdf for structure requirements
Follow chapter structure: Each chapter has specific content requirements per UNIR guidelines
References are UNIFIED: All references go in docs/06_referencias_bibliograficas.md, NOT per-chapter
Use APA format: All citations must follow APA style
Include "Fuentes de datos": Each chapter should list which repository files the data came from
Language: Documentation is in Spanish (thesis requirement), code comments in English
Hardware context: Remember this is CPU-only execution. Any suggestions about GPU training should acknowledge this limitation
When in doubt, ask: If the user requests data that doesn't exist, ask rather than inventing numbers
DIAGRAMS MUST BE IN MERMAID FORMAT: All diagrams, flowcharts, and visualizations in the documentation MUST use Mermaid syntax. This ensures:
- Version control friendly (text-based)
- Consistent styling across all chapters
- Easy to edit and maintain
- Renders properly in GitHub and most Markdown viewers
Supported Mermaid diagram types:
- flowchart / graph - For pipelines, workflows, architectures
- xychart-beta - For bar charts, comparisons
- sequenceDiagram - For process interactions
- classDiagram - For class structures
- stateDiagram - For state machines
- pie - For proportional data
Example:
```
flowchart LR
    A[Input] --> B[Process] --> C[Output]
```

Common Tasks

Adding new experiments: Update src/paddle_ocr_fine_tune_unir_raytune.ipynb
Updating documentation: Edit files in docs/
Adding references: Add to docs/06_referencias_bibliograficas.md (unified list)
Dataset expansion: Use src/prepare_dataset.ipynb as template
Running evaluations: Use src/paddle_ocr_tuning.py CLI

Experiment Details

Ray Tune Configuration

tuner = tune.Tuner(
    trainable_paddle_ocr,
    tune_config=tune.TuneConfig(
        metric="CER",
        mode="min",
        search_alg=OptunaSearch(),
        num_samples=64,
        max_concurrent_trials=2
    )
)

Dataset

Source: UNIR TFE instructions PDF
Pages: 24
Resolution: 300 DPI
Ground truth: Extracted via PyMuPDF

Metrics

CER (Character Error Rate) - Primary metric
WER (Word Error Rate) - Secondary metric
Calculated using jiwer library

22 KiB Raw Blame History