19 KiB
Claude Code Context - Masters Thesis OCR Project
Project Overview
This is a Master's Thesis (TFM) for UNIR's Master in Artificial Intelligence. The project focuses on OCR hyperparameter optimization using Ray Tune with Optuna for Spanish academic documents.
Author: Sergio Jiménez Jiménez University: UNIR (Universidad Internacional de La Rioja) Year: 2025
Key Context
Why Hyperparameter Optimization Instead of Fine-tuning
Due to hardware limitations (no dedicated GPU, CPU-only execution), the project pivoted from fine-tuning to hyperparameter optimization:
- Fine-tuning deep learning models without GPU is prohibitively slow
- Inference time is ~69 seconds/page on CPU
- Hyperparameter optimization proved to be an effective alternative, achieving 80.9% CER reduction
Main Results
| Model | CER | Character Accuracy |
|---|---|---|
| PaddleOCR Baseline | 7.78% | 92.22% |
| PaddleOCR-HyperAdjust | 1.49% | 98.51% |
Goal achieved: CER < 2% (target was < 2%, result is 1.49%)
Optimal Configuration Found
config_optimizada = {
"textline_orientation": True, # CRITICAL - reduces CER ~70%
"use_doc_orientation_classify": False,
"use_doc_unwarping": False,
"text_det_thresh": 0.4690,
"text_det_box_thresh": 0.5412,
"text_det_unclip_ratio": 0.0,
"text_rec_score_thresh": 0.6350,
}
Key Findings
textline_orientation=Trueis the most impactful parameter (reduces CER by 69.7%)text_det_threshhas -0.52 correlation with CER; values < 0.1 cause catastrophic failures- Document correction modules (
use_doc_orientation_classify,use_doc_unwarping) are unnecessary for digital PDFs
Repository Structure
MastersThesis/
├── docs/ # Thesis chapters in Markdown (matches template structure)
│ ├── 00_resumen.md # Resumen + Abstract
│ ├── 01_introduccion.md # Chapter 1: Introducción
│ ├── 02_contexto_estado_arte.md # Chapter 2: Contexto y estado del arte
│ ├── 03_objetivos_metodologia.md # Chapter 3: Objetivos y metodología
│ ├── 04_desarrollo_especifico.md # Chapter 4: Desarrollo específico (4.1, 4.2, 4.3)
│ ├── 05_conclusiones_trabajo_futuro.md # Chapter 5: Conclusiones
│ └── 06_referencias_bibliograficas.md # Referencias bibliográficas
├── src/
│ ├── paddle_ocr_fine_tune_unir_raytune.ipynb # Main experiment (64 trials)
│ ├── paddle_ocr_tuning.py # CLI evaluation script
│ ├── dataset_manager.py # ImageTextDataset class
│ ├── prepare_dataset.ipynb # Dataset preparation
│ └── raytune_paddle_subproc_results_20251207_192320.csv # 64 trial results
├── results/ # Benchmark results CSVs
├── instructions/ # UNIR instructions and template
│ ├── instrucciones.pdf # TFE writing guidelines
│ └── plantilla_individual.pdf # Word template (PDF version)
├── ocr_benchmark_notebook.ipynb # Initial OCR benchmark
└── README.md
docs/ to Template Mapping
The template (plantilla_individual.pdf) requires 5 chapters. The docs/ files now match this structure exactly:
| Template Section | docs/ File | Notes |
|---|---|---|
| Resumen | 00_resumen.md (Spanish part) |
150-300 words |
| Abstract | 00_resumen.md (English part) |
150-300 words |
| 1. Introducción | 01_introduccion.md |
Subsections 1.1, 1.2, 1.3 |
| 2. Contexto y estado del arte | 02_contexto_estado_arte.md |
Subsections 2.1, 2.2, 2.3 |
| 3. Objetivos y metodología | 03_objetivos_metodologia.md |
Subsections 3.1, 3.2, 3.3 |
| 4. Desarrollo específico | 04_desarrollo_especifico.md |
Includes 4.1, 4.2, 4.3 |
| 5. Conclusiones y trabajo futuro | 05_conclusiones_trabajo_futuro.md |
Subsections 5.1, 5.2 |
| Referencias bibliográficas | 06_referencias_bibliograficas.md |
APA, alphabetical |
| Anexo A | (create from README) | Repository URL |
Important Data Files
Results CSV Files
src/raytune_paddle_subproc_results_20251207_192320.csv- 64 Ray Tune trials with configs and metricsresults/ai_ocr_benchmark_finetune_results_20251206_113206.csv- Per-page OCR benchmark results
Key Notebooks
src/paddle_ocr_fine_tune_unir_raytune.ipynb- Main Ray Tune experimentsrc/prepare_dataset.ipynb- PDF to image/text conversionocr_benchmark_notebook.ipynb- EasyOCR vs PaddleOCR vs DocTR comparison
Technical Stack
| Component | Version |
|---|---|
| Python | 3.11.9 |
| PaddlePaddle | 3.2.2 |
| PaddleOCR | 3.3.2 |
| Ray | 2.52.1 |
| Optuna | 4.6.0 |
Pending Work
Priority Tasks
- Validate on other document types - Test optimal config on invoices, forms, contracts
- Expand dataset - Current dataset has only 24 pages
- Complete unified thesis document - Merge docs/ chapters into final UNIR format
- Create presentation slides - For thesis defense
Optional Extensions
- Explore
text_det_unclip_ratioparameter (was fixed at 0.0) - Compare with actual fine-tuning (if GPU access obtained)
- Multi-objective optimization (CER + WER + inference time)
UNIR TFE Document Guidelines
CRITICAL: The thesis MUST follow UNIR's official template (instructions/plantilla_individual.pdf) and guidelines (instructions/instrucciones.pdf).
Work Type Classification
This thesis is a hybrid of Type 1 (Piloto experimental) and Type 3 (Comparativa de soluciones):
- Comparative study of OCR solutions (EasyOCR, PaddleOCR, DocTR)
- Experimental pilot with Ray Tune hyperparameter optimization
- 64 trials executed, results analyzed statistically
Document Structure (from plantilla_individual.pdf - MANDATORY)
The TFE must follow this EXACT structure from the official template:
| Section | Subsections | Notes |
|---|---|---|
| Portada | Title, Author, Type, Director, Date | Use template format exactly |
| Resumen | 150-300 words + 3-5 Palabras clave | Spanish summary |
| Abstract | 150-300 words + 3-5 Keywords | English summary |
| Índice de contenidos | Auto-generated | New page |
| Índice de figuras | Auto-generated | New page |
| Índice de tablas | Auto-generated | New page |
| 1. Introducción | 1.1 Motivación, 1.2 Planteamiento del trabajo, 1.3 Estructura del trabajo | 3-5 pages |
| 2. Contexto y estado del arte | 2.1 Contexto del problema, 2.2 Estado del arte, 2.3 Conclusiones | 10-15 pages |
| 3. Objetivos concretos y metodología | 3.1 Objetivo general, 3.2 Objetivos específicos, 3.3 Metodología del trabajo | Variable |
| 4. Desarrollo específico | Varies by work type (see below) | Main content |
| 5. Conclusiones y trabajo futuro | 5.1 Conclusiones, 5.2 Líneas de trabajo futuro | Variable |
| Referencias bibliográficas | APA format, alphabetical, hanging indent | Variable |
| Anexo A | Código fuente y datos analizados | Repository URL |
Total length: 50-90 pages (excluding cover, resumen, abstract, indices, annexes)
Chapter-Specific Requirements (from plantilla_individual.pdf)
1. Introducción
The introduction must give a clear first idea of what was intended, the conclusions reached, and the procedure followed. Key ideas: problem identification, justification of importance, general objectives, preview of contribution.
1.1 Motivación:
- Present the problem to solve
- Justify importance to educational/scientific community
- Answer: What problem? What are the causes? Why is it relevant?
- Must include references to prior research
1.2 Planteamiento del trabajo:
- Briefly state the problem/need detected
- Describe the proposal and purpose
- Answer: How to solve? What is proposed?
1.3 Estructura del trabajo:
- Briefly describe what each subsequent chapter contains
2. Contexto y estado del arte
Study the application domain in depth, citing numerous references. Must consult different sources (not just online - also technical manuals, books).
2.1 Contexto del problema:
- Deep study of the application domain
2.2 Estado del arte:
- Antecedents, current studies, comparison of existing tools
- Must reference key authors in the field (justify exclusions)
2.3 Conclusiones:
- Summary linking research to the work to be done
- How findings affect the specific development
3. Objetivos concretos y metodología de trabajo
Bridge between domain study and contribution. Three required elements: (1) general objective, (2) specific objectives, (3) methodology.
3.1 Objetivo general:
- Must be SMART (Doran, 1981)
- Focus on achieving an observable effect, not just "create a tool"
- Example: "Mejorar el servicio X logrando Y valorado positivamente (mínimo 4/5) por Z"
3.2 Objetivos específicos:
- Divide general objective into analyzable sub-objectives
- Must be SMART
- Use infinitive verbs: Analizar, Calcular, Clasificar, Comparar, Conocer, Cuantificar, Desarrollar, Describir, Descubrir, Determinar, Establecer, Explorar, Identificar, Indagar, Medir, Sintetizar, Verificar
- Typically ~5 objectives: 1-2 about state of art, 2-3 about development
3.3 Metodología del trabajo:
- Describe steps to achieve objectives
- Explain WHY each step
- What instruments will be used
- How results will be analyzed
4. Desarrollo específico de la contribución
Structure depends on work type. Organize by methodology phases/activities.
For Type 1 (Piloto experimental):
- 4.1 Descripción detallada del experimento
- Technologies used (with justification)
- How pilot was organized
- Participants (demographics)
- Automatic evaluation techniques
- How experiment proceeded
- Monitoring/evaluation instruments
- Statistical analysis types
- 4.2 Descripción de los resultados (objective, no interpretation)
- Summary tables, result graphs, relevant data identification
- 4.3 Discusión
- Relevance of results, explanations for anomalies, highlight key findings
For Type 3 (Comparativa de soluciones):
- 4.1 Planteamiento de la comparativa
- Problem identification, alternative solutions to evaluate
- Success criteria, measures to take
- 4.2 Desarrollo de la comparativa
- All results and measurements obtained
- Graphs, tables, data visualization
- 4.3 Discusión y análisis de resultados
- Discussion of meaning, advantages/disadvantages of solutions
5. Conclusiones y trabajo futuro
5.1 Conclusiones:
- Summary of problem, approach, and why solution is valid
- Summary of contributions
- Relate contributions and results to objectives - discuss degree of achievement
5.2 Líneas de trabajo futuro:
- Future work that would add value
- Justify how contribution can be used and in what fields
SMART Objectives Requirements
ALL objectives (general and specific) MUST be SMART:
| Criterion | Requirement | Example from this thesis |
|---|---|---|
| Specific | Clearly define what to achieve | "Optimizar PaddleOCR para documentos en español" |
| Measurable | Quantifiable success metric | "CER < 2%" |
| Attainable | Feasible with available resources | "Sin GPU, usando optimización de hiperparámetros" |
| Relevant | Demonstrable impact | "Mejora extracción de texto en documentos académicos" |
| Time-bound | Achievable in timeframe | "Un cuatrimestre" |
Citation and Reference Rules
APA Format is MANDATORY
Reference guide: https://bibliografiaycitas.unir.net/
In-text citations:
- Single author: (Du, 2020) or Du (2020)
- Two authors: (Du & Li, 2020)
- Three+ authors: (Du et al., 2020)
Reference list examples:
# Journal article with DOI
Shi, B., Bai, X., & Yao, C. (2016). An end-to-end trainable neural network
for image-based sequence recognition. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 39(11), 2298-2304.
https://doi.org/10.1109/TPAMI.2016.2646371
# Conference paper
Akiba, T., Sano, S., Yanase, T., Ohta, T., & Koyama, M. (2019). Optuna:
A next-generation hyperparameter optimization framework. Proceedings
of the 25th ACM SIGKDD, 2623-2631.
https://doi.org/10.1145/3292500.3330701
# arXiv preprint
Du, Y., Li, C., Guo, R., ... & Wang, H. (2020). PP-OCR: A practical ultra
lightweight OCR system. arXiv preprint arXiv:2009.09941.
https://arxiv.org/abs/2009.09941
# Software/GitHub repository
PaddlePaddle. (2024). PaddleOCR: Awesome multilingual OCR toolkits based
on PaddlePaddle. GitHub. https://github.com/PaddlePaddle/PaddleOCR
# Book
Cohen, J. (1988). Statistical power analysis for the behavioral sciences
(2nd ed.). Lawrence Erlbaum Associates.
Reference Rules
- NO Wikipedia citations
- Include variety: books, conferences, journal articles (not just URLs)
- All cited references must appear in reference list
- All references in list must be cited in text
- Order alphabetically by first author's surname
- Include DOI or URL when available
Document Formatting Rules
Page Setup
| Element | Specification |
|---|---|
| Page size | A4 |
| Left margin | 3.0 cm |
| Right margin | 2.0 cm |
| Top/Bottom margins | 2.5 cm |
| Header | Student name + TFE title |
| Footer | Page number |
Typography
| Element | Format |
|---|---|
| Body text | Calibri 12, justified, 1.5 line spacing, 6pt before/after |
| Título 1 | Calibri Light 18, blue, justified, 1.5 spacing |
| Título 2 | Calibri Light 14, blue, justified, 1.5 spacing |
| Título 3 | Calibri Light 12, justified, 1.5 spacing |
| Footnotes | Calibri 10, justified, single spacing |
| Code | Can reduce to 9pt if needed |
Tables and Figures (from plantilla_individual.pdf)
Table format example:
Tabla 1. Ejemplo de tabla con sus principales elementos.
[TABLE CONTENT]
Fuente: American Psychological Association, 2020a.
Figure format example:
Figura 1. Ejemplo de figura realizada para nuestro trabajo.
[FIGURE]
Fuente: American Psychological Association, 2020b.
Rules:
- Title position: Above the table/figure
- Numbering format: "Tabla 1." / "Figura 1." (Calibri 12, bold)
- Title text: Calibri 12, italic (after the number)
- Source: Below, centered, format "Fuente: Author, Year."
- Can reduce font to 9pt for dense tables
- Can use landscape orientation for large tables
- Tables should have horizontal lines only (no vertical lines) per APA style
Writing Style Rules
MUST DO:
- Each chapter starts with introductory paragraph explaining content
- Each paragraph has at least 3 sentences
- Verify originality (cite all sources)
- Check spelling with Word corrector
- Ensure logical flow between paragraphs
- Define concepts and include pertinent citations
MUST NOT DO:
- Two consecutive headings without text between them
- Superfluous phrases and repetition of ideas
- Short paragraphs (less than 3 sentences)
- Missing figure/table numbers or titles
- Broken index generation
Annexes Requirements
Anexo A - Código fuente y datos:
- Include repository URL where code is hosted
- Student must be sole author and owner of repository
- No commits from other users
- Data used should also be in repository
- If confidential (company project), justify why not shared
Final Submission
- Drafts: Submit in Word format
- Final deposit: Submit in PDF format
- Verify all indices generate correctly before final submission
Guidelines for Claude
CRITICAL: Academic Rigor Requirements
This is a Master's Thesis. Academic rigor is NON-NEGOTIABLE.
DO NOT:
- NEVER fabricate data or statistics - Every number must come from an actual file in this repository
- NEVER invent comparison results - If we don't have data for EasyOCR or DocTR comparisons, don't make up numbers
- NEVER assume or estimate values - If a metric isn't in the CSV/notebook, don't include it
- NEVER extrapolate beyond what the data shows - 24 pages is a limited dataset, acknowledge this
- NEVER claim results that weren't measured - Only report what was actually computed
ALWAYS:
- Read the source file first before citing any result
- Quote exact values from CSV files (e.g., CER 0.011535 not "approximately 1%")
- Reference the specific file and location for every data point
- Acknowledge limitations explicitly (dataset size, CPU-only, single document type)
- Distinguish between measured results and interpretations
Data Sources (ONLY use these):
| Data Type | Source File |
|---|---|
| Ray Tune 64 trials | src/raytune_paddle_subproc_results_20251207_192320.csv |
| Per-page benchmark | results/ai_ocr_benchmark_finetune_results_20251206_113206.csv |
| Experiment code | src/paddle_ocr_fine_tune_unir_raytune.ipynb |
| Final comparison | Output cells in the notebook (baseline vs optimized) |
Example of WRONG vs RIGHT:
WRONG: "EasyOCR achieved 8.5% CER while PaddleOCR achieved 5.2% CER" (We don't have this comparison data in our results files)
RIGHT: "PaddleOCR with baseline configuration achieved CER between 1.54% and 6.40% across pages 5-9 (source: results/ai_ocr_benchmark_finetune_results_20251206_113206.csv)"
WRONG: "The optimization improved results by approximately 80%"
RIGHT: "The optimization reduced CER from 7.78% to 1.49%, a reduction of 80.9% (source: final comparison in paddle_ocr_fine_tune_unir_raytune.ipynb)"
When Working on Documentation
-
Read UNIR guidelines first: Check
instructions/instrucciones.pdffor structure requirements -
Follow chapter structure: Each chapter has specific content requirements per UNIR guidelines
-
References are UNIFIED: All references go in
docs/06_referencias_bibliograficas.md, NOT per-chapter -
Use APA format: All citations must follow APA style
-
Include "Fuentes de datos": Each chapter should list which repository files the data came from
-
Language: Documentation is in Spanish (thesis requirement), code comments in English
-
Hardware context: Remember this is CPU-only execution. Any suggestions about GPU training should acknowledge this limitation
-
When in doubt, ask: If the user requests data that doesn't exist, ask rather than inventing numbers
Common Tasks
- Adding new experiments: Update
src/paddle_ocr_fine_tune_unir_raytune.ipynb - Updating documentation: Edit files in
docs/ - Adding references: Add to
docs/06_referencias_bibliograficas.md(unified list) - Dataset expansion: Use
src/prepare_dataset.ipynbas template - Running evaluations: Use
src/paddle_ocr_tuning.pyCLI
Experiment Details
Ray Tune Configuration
tuner = tune.Tuner(
trainable_paddle_ocr,
tune_config=tune.TuneConfig(
metric="CER",
mode="min",
search_alg=OptunaSearch(),
num_samples=64,
max_concurrent_trials=2
)
)
Dataset
- Source: UNIR TFE instructions PDF
- Pages: 24
- Resolution: 300 DPI
- Ground truth: Extracted via PyMuPDF
Metrics
- CER (Character Error Rate) - Primary metric
- WER (Word Error Rate) - Secondary metric
- Calculated using
jiwerlibrary