unir/MastersThesis

Fork 0

Files

sergio 6b98aeacac

build_docker / essential (pull_request) Successful in 1s

Details

build_docker / build_cpu (pull_request) Successful in 5m0s

Details

build_docker / build_gpu (pull_request) Successful in 22m37s

Details

build_docker / build_easyocr (pull_request) Successful in 18m5s

Details

build_docker / build_easyocr_gpu (pull_request) Successful in 15m43s

Details

build_docker / build_doctr (pull_request) Successful in 17m17s

Details

build_docker / build_raytune (pull_request) Successful in 3m24s

Details

build_docker / build_doctr_gpu (pull_request) Successful in 16m54s

Details

assit commands for claude

2026-01-20 11:35:56 +01:00

22 KiB

Raw Blame History

Claude Code Context - Masters Thesis OCR Project

Project Overview

This is a Master's Thesis (TFM) for UNIR's Master in Artificial Intelligence. The project focuses on OCR hyperparameter optimization using Ray Tune with Optuna for Spanish academic documents.

Author: Sergio Jiménez Jiménez University: UNIR (Universidad Internacional de La Rioja) Year: 2025

Key Context

Why Hyperparameter Optimization Instead of Fine-tuning

The project chose hyperparameter optimization over fine-tuning because:

Fine-tuning requires extensive labeled datasets specific to the domain
Hyperparameter tuning can improve pretrained models without retraining
GPU acceleration (RTX 3060) enables efficient exploration of hyperparameter space

Main Results (GPU - Jan 2026)

Model	CER	Character Accuracy
PaddleOCR Baseline	8.85%	91.15%
PaddleOCR-HyperAdjust (full dataset)	7.72%	92.28%
PaddleOCR-HyperAdjust (best trial)	0.79%	99.21%

Goal status: CER < 2% achieved in best trial (0.79%). Full dataset shows 12.8% improvement.

Optimal Configuration Found (GPU)

config_optimizada = {
    "textline_orientation": True,           # CRITICAL for complex layouts
    "use_doc_orientation_classify": True,   # Improves document orientation
    "use_doc_unwarping": False,
    "text_det_thresh": 0.0462,              # -0.52 correlation with CER
    "text_det_box_thresh": 0.4862,
    "text_det_unclip_ratio": 0.0,
    "text_rec_score_thresh": 0.5658,
}

Key Findings

textline_orientation=True is critical for documents with mixed layouts
use_doc_orientation_classify=True improves document orientation detection in GPU config
text_det_thresh has -0.52 correlation with CER; values < 0.01 cause catastrophic failures
use_doc_unwarping=False is optimal for digital PDFs (unnecessary processing)

Repository Structure

MastersThesis/
├── docs/                    # Thesis chapters in Markdown (UNIR template structure)
│   ├── 00_resumen.md                      # Resumen + Abstract + Keywords
│   ├── 01_introduccion.md                 # 1. Introducción (1.1, 1.2, 1.3)
│   ├── 02_contexto_estado_arte.md         # 2. Contexto y estado del arte (2.1, 2.2, 2.3)
│   ├── 03_objetivos_metodologia.md        # 3. Objetivos y metodología (3.1, 3.2, 3.3, 3.4)
│   ├── 04_desarrollo_especifico.md        # 4. Desarrollo específico (4.1, 4.2, 4.3)
│   ├── 05_conclusiones_trabajo_futuro.md  # 5. Conclusiones (5.1, 5.2)
│   ├── 06_referencias_bibliograficas.md   # Referencias bibliográficas (APA format)
│   └── 07_anexo_a.md                      # Anexo A: Código fuente y datos
├── thesis_output/           # Generated thesis document
│   ├── plantilla_individual.htm           # Complete TFM (open in Word)
│   └── figures/                           # PNG figures from Mermaid diagrams
│       ├── figura_1.png ... figura_7.png
│       └── figures_manifest.json
├── src/
│   ├── paddle_ocr_fine_tune_unir_raytune.ipynb  # Main experiment (64 trials)
│   ├── paddle_ocr_tuning.py                      # CLI evaluation script
│   ├── dataset_manager.py                        # ImageTextDataset class
│   ├── prepare_dataset.ipynb                     # Dataset preparation
│   └── raytune_paddle_subproc_results_20251207_192320.csv  # 64 trial results
├── results/                 # Benchmark results CSVs
├── instructions/            # UNIR instructions and template
│   ├── instrucciones.pdf         # TFE writing guidelines
│   ├── plantilla_individual.pdf  # Word template (PDF version)
│   └── plantilla_individual.htm  # Word template (HTML version, source)
├── apply_content.py         # Generates TFM document from docs/ + template
├── generate_mermaid_figures.py  # Converts Mermaid diagrams to PNG
├── ocr_benchmark_notebook.ipynb  # Initial OCR benchmark
└── README.md

docs/ to Template Mapping

The template (plantilla_individual.pdf) requires 5 chapters. The docs/ files now match this structure exactly:

Template Section	docs/ File	Notes
Resumen	`00_resumen.md` (Spanish part)	150-300 words + Palabras clave
Abstract	`00_resumen.md` (English part)	150-300 words + Keywords
1. Introducción	`01_introduccion.md`	Subsections 1.1, 1.2, 1.3
2. Contexto y estado del arte	`02_contexto_estado_arte.md`	Subsections 2.1, 2.2, 2.3 + Mermaid diagrams
3. Objetivos y metodología	`03_objetivos_metodologia.md`	Subsections 3.1, 3.2, 3.3, 3.4 + Mermaid diagrams
4. Desarrollo específico	`04_desarrollo_especifico.md`	Subsections 4.1, 4.2, 4.3 + Mermaid charts
5. Conclusiones y trabajo futuro	`05_conclusiones_trabajo_futuro.md`	Subsections 5.1, 5.2
Referencias bibliográficas	`06_referencias_bibliograficas.md`	APA, alphabetical
Anexo A	`07_anexo_a.md`	Repository URL + structure

Important Data Files

Results CSV Files (GPU - PRIMARY)

src/results/raytune_paddle_results_20260119_122609.csv - 64 Ray Tune trials PaddleOCR GPU (PRIMARY)
src/results/raytune_easyocr_results_20260119_120204.csv - 64 Ray Tune trials EasyOCR GPU
src/results/raytune_doctr_results_20260119_121445.csv - 64 Ray Tune trials DocTR GPU

Results CSV Files (CPU - time reference only)

src/raytune_paddle_subproc_results_20251207_192320.csv - CPU execution for time comparison (69.4s/page vs 0.84s/page GPU)

Key Scripts

src/run_tuning.py - Main Ray Tune optimization script
src/raytune/raytune_ocr.py - Ray Tune utilities and search spaces
src/paddle_ocr/paddle_ocr_tuning_rest.py - PaddleOCR REST API

Technical Stack

Component	Version
Python	3.11.9
PaddlePaddle	3.2.2
PaddleOCR	3.3.2
Ray	2.52.1
Optuna	4.6.0

Pending Work

Completed Tasks

Structure docs/ to match UNIR template - All chapters now follow exact numbering (1.1, 1.2, etc.)
Add Mermaid diagrams - 7 diagrams added (OCR pipeline, Ray Tune architecture, methodology flowcharts, CER comparison charts)
Generate unified thesis document - apply_content.py generates complete document from docs/
Convert Mermaid to PNG - generate_mermaid_figures.py generates figures automatically
Proper template formatting - Tables/figures use Piedefoto-tabla class, references use MsoBibliography

Priority Tasks

Validate on other document types - Test optimal config on invoices, forms, contracts
Use larger tuning subset - Current 5 pages caused overfitting; recommend 15-20 pages
Create presentation slides - For thesis defense
Final document review - Open in Word, update indices (Ctrl+A, F9), verify formatting

Optional Extensions

Explore text_det_unclip_ratio parameter (was fixed at 0.0)
Compare with actual fine-tuning
Multi-objective optimization (CER + WER + inference time)

Thesis Document Generation

To regenerate the thesis document:

# 1. Generate PNG figures from Mermaid diagrams
python3 generate_mermaid_figures.py

# 2. Apply docs/ content to UNIR template
python3 apply_content.py

# 3. Open in Word and finalize
# - Open thesis_output/plantilla_individual.htm in Microsoft Word
# - Press Ctrl+A then F9 to update all indices
# - Save as .docx

What apply_content.py does:

Replaces Resumen and Abstract with actual content + keywords
Replaces all 5 chapters with content from docs/
Replaces Referencias with APA-formatted bibliography
Replaces Anexo with repository information
Converts Mermaid diagrams to embedded PNG images
Formats tables with Piedefoto-tabla captions and sources
Removes template instruction text ("Importante:", "Ejemplo de nota al pie", etc.)

UNIR TFE Document Guidelines

CRITICAL: The thesis MUST follow UNIR's official template (instructions/plantilla_individual.pdf) and guidelines (instructions/instrucciones.pdf).

Work Type Classification

This thesis is a hybrid of Type 1 (Piloto experimental) and Type 3 (Comparativa de soluciones):

Comparative study of OCR solutions (EasyOCR, PaddleOCR, DocTR)
Experimental pilot with Ray Tune hyperparameter optimization
64 trials executed, results analyzed statistically

Document Structure (from plantilla_individual.pdf - MANDATORY)

The TFE must follow this EXACT structure from the official template:

Section	Subsections	Notes
Portada	Title, Author, Type, Director, Date	Use template format exactly
Resumen	150-300 words + 3-5 Palabras clave	Spanish summary
Abstract	150-300 words + 3-5 Keywords	English summary
Índice de contenidos	Auto-generated	New page
Índice de figuras	Auto-generated	New page
Índice de tablas	Auto-generated	New page
1. Introducción	1.1 Motivación, 1.2 Planteamiento del trabajo, 1.3 Estructura del trabajo	3-5 pages
2. Contexto y estado del arte	2.1 Contexto del problema, 2.2 Estado del arte, 2.3 Conclusiones	10-15 pages
3. Objetivos concretos y metodología	3.1 Objetivo general, 3.2 Objetivos específicos, 3.3 Metodología del trabajo	Variable
4. Desarrollo específico	Varies by work type (see below)	Main content
5. Conclusiones y trabajo futuro	5.1 Conclusiones, 5.2 Líneas de trabajo futuro	Variable
Referencias bibliográficas	APA format, alphabetical, hanging indent	Variable
Anexo A	Código fuente y datos analizados	Repository URL

Total length: 50-90 pages (excluding cover, resumen, abstract, indices, annexes)

Chapter-Specific Requirements (from plantilla_individual.pdf)

1. Introducción

The introduction must give a clear first idea of what was intended, the conclusions reached, and the procedure followed. Key ideas: problem identification, justification of importance, general objectives, preview of contribution.

1.1 Motivación:

Present the problem to solve
Justify importance to educational/scientific community
Answer: What problem? What are the causes? Why is it relevant?
Must include references to prior research

1.2 Planteamiento del trabajo:

Briefly state the problem/need detected
Describe the proposal and purpose
Answer: How to solve? What is proposed?

1.3 Estructura del trabajo:

Briefly describe what each subsequent chapter contains

2. Contexto y estado del arte

Study the application domain in depth, citing numerous references. Must consult different sources (not just online - also technical manuals, books).

2.1 Contexto del problema:

Deep study of the application domain

2.2 Estado del arte:

Antecedents, current studies, comparison of existing tools
Must reference key authors in the field (justify exclusions)

2.3 Conclusiones:

Summary linking research to the work to be done
How findings affect the specific development

3. Objetivos concretos y metodología de trabajo

Bridge between domain study and contribution. Three required elements: (1) general objective, (2) specific objectives, (3) methodology.

3.1 Objetivo general:

Must be SMART (Doran, 1981)
Focus on achieving an observable effect, not just "create a tool"
Example: "Mejorar el servicio X logrando Y valorado positivamente (mínimo 4/5) por Z"

3.2 Objetivos específicos:

Divide general objective into analyzable sub-objectives
Must be SMART
Use infinitive verbs: Analizar, Calcular, Clasificar, Comparar, Conocer, Cuantificar, Desarrollar, Describir, Descubrir, Determinar, Establecer, Explorar, Identificar, Indagar, Medir, Sintetizar, Verificar
Typically ~5 objectives: 1-2 about state of art, 2-3 about development

3.3 Metodología del trabajo:

Describe steps to achieve objectives
Explain WHY each step
What instruments will be used
How results will be analyzed

4. Desarrollo específico de la contribución

Structure depends on work type. Organize by methodology phases/activities.

For Type 1 (Piloto experimental):

4.1 Descripción detallada del experimento
- Technologies used (with justification)
- How pilot was organized
- Participants (demographics)
- Automatic evaluation techniques
- How experiment proceeded
- Monitoring/evaluation instruments
- Statistical analysis types
4.2 Descripción de los resultados (objective, no interpretation)
- Summary tables, result graphs, relevant data identification
4.3 Discusión
- Relevance of results, explanations for anomalies, highlight key findings

For Type 3 (Comparativa de soluciones):

4.1 Planteamiento de la comparativa
- Problem identification, alternative solutions to evaluate
- Success criteria, measures to take
4.2 Desarrollo de la comparativa
- All results and measurements obtained
- Graphs, tables, data visualization
4.3 Discusión y análisis de resultados
- Discussion of meaning, advantages/disadvantages of solutions

5. Conclusiones y trabajo futuro

5.1 Conclusiones:

Summary of problem, approach, and why solution is valid
Summary of contributions
Relate contributions and results to objectives - discuss degree of achievement

5.2 Líneas de trabajo futuro:

Future work that would add value
Justify how contribution can be used and in what fields

SMART Objectives Requirements

ALL objectives (general and specific) MUST be SMART:

Criterion	Requirement	Example from this thesis
Specific	Clearly define what to achieve	"Optimizar PaddleOCR para documentos en español"
Measurable	Quantifiable success metric	"CER < 2%"
Attainable	Feasible with available resources	"Sin GPU, usando optimización de hiperparámetros"
Relevant	Demonstrable impact	"Mejora extracción de texto en documentos académicos"
Time-bound	Achievable in timeframe	"Un cuatrimestre"

Citation and Reference Rules

APA Format is MANDATORY

Reference guide: https://bibliografiaycitas.unir.net/

In-text citations:

Single author: (Du, 2020) or Du (2020)
Two authors: (Du & Li, 2020)
Three+ authors: (Du et al., 2020)

Reference list examples:

# Journal article with DOI
Shi, B., Bai, X., & Yao, C. (2016). An end-to-end trainable neural network
  for image-based sequence recognition. IEEE Transactions on Pattern
  Analysis and Machine Intelligence, 39(11), 2298-2304.
  https://doi.org/10.1109/TPAMI.2016.2646371

# Conference paper
Akiba, T., Sano, S., Yanase, T., Ohta, T., & Koyama, M. (2019). Optuna:
  A next-generation hyperparameter optimization framework. Proceedings
  of the 25th ACM SIGKDD, 2623-2631.
  https://doi.org/10.1145/3292500.3330701

# arXiv preprint
Du, Y., Li, C., Guo, R., ... & Wang, H. (2020). PP-OCR: A practical ultra
  lightweight OCR system. arXiv preprint arXiv:2009.09941.
  https://arxiv.org/abs/2009.09941

# Software/GitHub repository
PaddlePaddle. (2024). PaddleOCR: Awesome multilingual OCR toolkits based
  on PaddlePaddle. GitHub. https://github.com/PaddlePaddle/PaddleOCR

# Book
Cohen, J. (1988). Statistical power analysis for the behavioral sciences
  (2nd ed.). Lawrence Erlbaum Associates.

Reference Rules

NO Wikipedia citations
Include variety: books, conferences, journal articles (not just URLs)
All cited references must appear in reference list
All references in list must be cited in text
Order alphabetically by first author's surname
Include DOI or URL when available

Document Formatting Rules

Page Setup

Element	Specification
Page size	A4
Left margin	3.0 cm
Right margin	2.0 cm
Top/Bottom margins	2.5 cm
Header	Student name + TFE title
Footer	Page number

Typography

Element	Format
Body text	Calibri 12, justified, 1.5 line spacing, 6pt before/after
Título 1	Calibri Light 18, blue, justified, 1.5 spacing
Título 2	Calibri Light 14, blue, justified, 1.5 spacing
Título 3	Calibri Light 12, justified, 1.5 spacing
Footnotes	Calibri 10, justified, single spacing
Code	Can reduce to 9pt if needed

Tables and Figures (from plantilla_individual.pdf)

Table format example:

Tabla 1. Ejemplo de tabla con sus principales elementos.
[TABLE CONTENT]
Fuente: American Psychological Association, 2020a.

Figure format example:

Figura 1. Ejemplo de figura realizada para nuestro trabajo.
[FIGURE]
Fuente: American Psychological Association, 2020b.

Rules:

Title position: Above the table/figure
Numbering format: "Tabla 1." / "Figura 1." (Calibri 12, bold)
Title text: Calibri 12, italic (after the number)
Source: Below, centered, format "Fuente: Author, Year."
Can reduce font to 9pt for dense tables
Can use landscape orientation for large tables
Tables should have horizontal lines only (no vertical lines) per APA style

Writing Style Rules

MUST DO:

Each chapter starts with introductory paragraph explaining content
Each paragraph has at least 3 sentences
Verify originality (cite all sources)
Check spelling with Word corrector
Ensure logical flow between paragraphs
Define concepts and include pertinent citations

MUST NOT DO:

Two consecutive headings without text between them
Superfluous phrases and repetition of ideas
Short paragraphs (less than 3 sentences)
Missing figure/table numbers or titles
Broken index generation

Annexes Requirements

Anexo A - Código fuente y datos:

Include repository URL where code is hosted
Student must be sole author and owner of repository
No commits from other users
Data used should also be in repository
If confidential (company project), justify why not shared

Final Submission

Drafts: Submit in Word format
Final deposit: Submit in PDF format
Verify all indices generate correctly before final submission

Guidelines for Claude

CRITICAL: Academic Rigor Requirements

This is a Master's Thesis. Academic rigor is NON-NEGOTIABLE.

DO NOT:

NEVER fabricate data or statistics - Every number must come from an actual file in this repository
NEVER invent comparison results - If we don't have data for EasyOCR or DocTR comparisons, don't make up numbers
NEVER assume or estimate values - If a metric isn't in the CSV/notebook, don't include it
NEVER extrapolate beyond what the data shows - 24 pages is a limited dataset, acknowledge this
NEVER claim results that weren't measured - Only report what was actually computed

ALWAYS:

Read the source file first before citing any result
Quote exact values from CSV files (e.g., CER 0.011535 not "approximately 1%")
Reference the specific file and location for every data point
Acknowledge limitations explicitly (dataset size, CPU-only, single document type)
Distinguish between measured results and interpretations

Data Sources (ONLY use these):

Data Type	Source File
Ray Tune 64 trials	`src/raytune_paddle_subproc_results_20251207_192320.csv`
Experiment code	`src/paddle_ocr_fine_tune_unir_raytune.ipynb`
Final comparison	Output cells in the notebook (baseline vs optimized)

Example of WRONG vs RIGHT:

WRONG: "EasyOCR achieved 8.5% CER while PaddleOCR achieved 5.2% CER" (We don't have this comparison data in our results files)

RIGHT: "The optimization reduced CER from 7.78% to 1.49%, a reduction of 80.9% (source: final comparison in paddle_ocr_fine_tune_unir_raytune.ipynb)"

WRONG: "The optimization improved results by approximately 80%"

RIGHT: "From the 64 trials in raytune_paddle_subproc_results_20251207_192320.csv, minimum CER achieved was 1.15%"

When Working on Documentation

Read UNIR guidelines first: Check instructions/instrucciones.pdf for structure requirements
Follow chapter structure: Each chapter has specific content requirements per UNIR guidelines
References are UNIFIED: All references go in docs/06_referencias_bibliograficas.md, NOT per-chapter
Use APA format: All citations must follow APA style
Include "Fuentes de datos": Each chapter should list which repository files the data came from
Language: Documentation is in Spanish (thesis requirement), code comments in English
Hardware context: Remember this is CPU-only execution. Any suggestions about GPU training should acknowledge this limitation
When in doubt, ask: If the user requests data that doesn't exist, ask rather than inventing numbers
DIAGRAMS MUST BE IN MERMAID FORMAT: All diagrams, flowcharts, and visualizations in the documentation MUST use Mermaid syntax. This ensures:
- Version control friendly (text-based)
- Consistent styling across all chapters
- Easy to edit and maintain
- Renders properly in GitHub and most Markdown viewers
Supported Mermaid diagram types:
- flowchart / graph - For pipelines, workflows, architectures
- xychart-beta - For bar charts, comparisons
- sequenceDiagram - For process interactions
- classDiagram - For class structures
- stateDiagram - For state machines
- pie - For proportional data
Example:
```
flowchart LR
    A[Input] --> B[Process] --> C[Output]
```

Common Tasks

Adding new experiments: Update src/paddle_ocr_fine_tune_unir_raytune.ipynb
Updating documentation: Edit files in docs/
Adding references: Add to docs/06_referencias_bibliograficas.md (unified list)
Dataset expansion: Use src/prepare_dataset.ipynb as template
Running evaluations: Use src/paddle_ocr_tuning.py CLI

Experiment Details

Ray Tune Configuration

tuner = tune.Tuner(
    trainable_paddle_ocr,
    tune_config=tune.TuneConfig(
        metric="CER",
        mode="min",
        search_alg=OptunaSearch(),
        num_samples=64,
        max_concurrent_trials=2
    )
)

Dataset

Source: UNIR TFE instructions PDF
Pages: 24
Resolution: 300 DPI
Ground truth: Extracted via PyMuPDF

Metrics

CER (Character Error Rate) - Primary metric
WER (Word Error Rate) - Secondary metric
Calculated using jiwer library

22 KiB Raw Blame History

Claude Code Context - Masters Thesis OCR Project

Project Overview

Key Context

Why Hyperparameter Optimization Instead of Fine-tuning

Main Results (GPU - Jan 2026)

Optimal Configuration Found (GPU)

Key Findings

Repository Structure

docs/ to Template Mapping

Important Data Files

Results CSV Files (GPU - PRIMARY)

Results CSV Files (CPU - time reference only)

Key Scripts

Technical Stack

Pending Work

Completed Tasks

Priority Tasks

Optional Extensions

Thesis Document Generation

UNIR TFE Document Guidelines

Work Type Classification

Document Structure (from plantilla_individual.pdf - MANDATORY)

Chapter-Specific Requirements (from plantilla_individual.pdf)

1. Introducción

2. Contexto y estado del arte

3. Objetivos concretos y metodología de trabajo

4. Desarrollo específico de la contribución

5. Conclusiones y trabajo futuro

SMART Objectives Requirements

Citation and Reference Rules

APA Format is MANDATORY

Reference Rules

Document Formatting Rules

Page Setup

Typography

Tables and Figures (from plantilla_individual.pdf)

Writing Style Rules

MUST DO:

MUST NOT DO:

Annexes Requirements

Final Submission

Guidelines for Claude

CRITICAL: Academic Rigor Requirements

DO NOT:

ALWAYS:

Data Sources (ONLY use these):

Example of WRONG vs RIGHT:

When Working on Documentation

Common Tasks

Experiment Details

Ray Tune Configuration

Dataset

Metrics

22 KiB

Raw Blame History