# Claude Code Context - Masters Thesis OCR Project ## Project Overview This is a **Master's Thesis (TFM)** for UNIR's Master in Artificial Intelligence. The project focuses on **OCR hyperparameter optimization** using Ray Tune with Optuna for Spanish academic documents. **Author:** Sergio Jiménez Jiménez **University:** UNIR (Universidad Internacional de La Rioja) **Year:** 2025 ## Key Context ### Why Hyperparameter Optimization Instead of Fine-tuning Due to **hardware limitations** (no dedicated GPU, CPU-only execution), the project pivoted from fine-tuning to hyperparameter optimization: - Fine-tuning deep learning models without GPU is prohibitively slow - Inference time is ~69 seconds/page on CPU - Hyperparameter optimization proved to be an effective alternative, achieving 80.9% CER reduction ### Main Results | Model | CER | Character Accuracy | |-------|-----|-------------------| | PaddleOCR Baseline | 7.78% | 92.22% | | PaddleOCR-HyperAdjust | **1.49%** | **98.51%** | **Goal achieved:** CER < 2% (target was < 2%, result is 1.49%) ### Optimal Configuration Found ```python config_optimizada = { "textline_orientation": True, # CRITICAL - reduces CER ~70% "use_doc_orientation_classify": False, "use_doc_unwarping": False, "text_det_thresh": 0.4690, "text_det_box_thresh": 0.5412, "text_det_unclip_ratio": 0.0, "text_rec_score_thresh": 0.6350, } ``` ### Key Findings 1. `textline_orientation=True` is the most impactful parameter (reduces CER by 69.7%) 2. `text_det_thresh` has -0.52 correlation with CER; values < 0.1 cause catastrophic failures 3. Document correction modules (`use_doc_orientation_classify`, `use_doc_unwarping`) are unnecessary for digital PDFs ## Repository Structure ``` MastersThesis/ ├── docs/ # Thesis chapters in Markdown (matches template structure) │ ├── 00_resumen.md # Resumen + Abstract │ ├── 01_introduccion.md # Chapter 1: Introducción │ ├── 02_contexto_estado_arte.md # Chapter 2: Contexto y estado del arte │ ├── 03_objetivos_metodologia.md # Chapter 3: Objetivos y metodología │ ├── 04_desarrollo_especifico.md # Chapter 4: Desarrollo específico (4.1, 4.2, 4.3) │ ├── 05_conclusiones_trabajo_futuro.md # Chapter 5: Conclusiones │ └── 06_referencias_bibliograficas.md # Referencias bibliográficas ├── src/ │ ├── paddle_ocr_fine_tune_unir_raytune.ipynb # Main experiment (64 trials) │ ├── paddle_ocr_tuning.py # CLI evaluation script │ ├── dataset_manager.py # ImageTextDataset class │ ├── prepare_dataset.ipynb # Dataset preparation │ └── raytune_paddle_subproc_results_20251207_192320.csv # 64 trial results ├── results/ # Benchmark results CSVs ├── instructions/ # UNIR instructions and template │ ├── instrucciones.pdf # TFE writing guidelines │ └── plantilla_individual.pdf # Word template (PDF version) ├── ocr_benchmark_notebook.ipynb # Initial OCR benchmark └── README.md ``` ### docs/ to Template Mapping The template (`plantilla_individual.pdf`) requires **5 chapters**. The docs/ files now match this structure exactly: | Template Section | docs/ File | Notes | |-----------------|------------|-------| | Resumen | `00_resumen.md` (Spanish part) | 150-300 words | | Abstract | `00_resumen.md` (English part) | 150-300 words | | 1. Introducción | `01_introduccion.md` | Subsections 1.1, 1.2, 1.3 | | 2. Contexto y estado del arte | `02_contexto_estado_arte.md` | Subsections 2.1, 2.2, 2.3 | | 3. Objetivos y metodología | `03_objetivos_metodologia.md` | Subsections 3.1, 3.2, 3.3 | | 4. Desarrollo específico | `04_desarrollo_especifico.md` | Includes 4.1, 4.2, 4.3 | | 5. Conclusiones y trabajo futuro | `05_conclusiones_trabajo_futuro.md` | Subsections 5.1, 5.2 | | Referencias bibliográficas | `06_referencias_bibliograficas.md` | APA, alphabetical | | Anexo A | (create from README) | Repository URL | ## Important Data Files ### Results CSV Files - `src/raytune_paddle_subproc_results_20251207_192320.csv` - 64 Ray Tune trials with configs and metrics - `results/ai_ocr_benchmark_finetune_results_20251206_113206.csv` - Per-page OCR benchmark results ### Key Notebooks - `src/paddle_ocr_fine_tune_unir_raytune.ipynb` - Main Ray Tune experiment - `src/prepare_dataset.ipynb` - PDF to image/text conversion - `ocr_benchmark_notebook.ipynb` - EasyOCR vs PaddleOCR vs DocTR comparison ## Technical Stack | Component | Version | |-----------|---------| | Python | 3.11.9 | | PaddlePaddle | 3.2.2 | | PaddleOCR | 3.3.2 | | Ray | 2.52.1 | | Optuna | 4.6.0 | ## Pending Work ### Priority Tasks 1. **Validate on other document types** - Test optimal config on invoices, forms, contracts 2. **Expand dataset** - Current dataset has only 24 pages 3. **Complete unified thesis document** - Merge docs/ chapters into final UNIR format 4. **Create presentation slides** - For thesis defense ### Optional Extensions - Explore `text_det_unclip_ratio` parameter (was fixed at 0.0) - Compare with actual fine-tuning (if GPU access obtained) - Multi-objective optimization (CER + WER + inference time) --- ## UNIR TFE Document Guidelines **CRITICAL:** The thesis MUST follow UNIR's official template (`instructions/plantilla_individual.pdf`) and guidelines (`instructions/instrucciones.pdf`). ### Work Type Classification This thesis is a **hybrid of Type 1 (Piloto experimental) and Type 3 (Comparativa de soluciones)**: - Comparative study of OCR solutions (EasyOCR, PaddleOCR, DocTR) - Experimental pilot with Ray Tune hyperparameter optimization - 64 trials executed, results analyzed statistically ### Document Structure (from plantilla_individual.pdf - MANDATORY) The TFE must follow this EXACT structure from the official template: | Section | Subsections | Notes | |---------|-------------|-------| | **Portada** | Title, Author, Type, Director, Date | Use template format exactly | | **Resumen** | 150-300 words + 3-5 Palabras clave | Spanish summary | | **Abstract** | 150-300 words + 3-5 Keywords | English summary | | **Índice de contenidos** | Auto-generated | New page | | **Índice de figuras** | Auto-generated | New page | | **Índice de tablas** | Auto-generated | New page | | **1. Introducción** | 1.1 Motivación, 1.2 Planteamiento del trabajo, 1.3 Estructura del trabajo | 3-5 pages | | **2. Contexto y estado del arte** | 2.1 Contexto del problema, 2.2 Estado del arte, 2.3 Conclusiones | 10-15 pages | | **3. Objetivos concretos y metodología** | 3.1 Objetivo general, 3.2 Objetivos específicos, 3.3 Metodología del trabajo | Variable | | **4. Desarrollo específico** | Varies by work type (see below) | Main content | | **5. Conclusiones y trabajo futuro** | 5.1 Conclusiones, 5.2 Líneas de trabajo futuro | Variable | | **Referencias bibliográficas** | APA format, alphabetical, hanging indent | Variable | | **Anexo A** | Código fuente y datos analizados | Repository URL | **Total length:** 50-90 pages (excluding cover, resumen, abstract, indices, annexes) ### Chapter-Specific Requirements (from plantilla_individual.pdf) #### 1. Introducción The introduction must give a clear first idea of what was intended, the conclusions reached, and the procedure followed. Key ideas: problem identification, justification of importance, general objectives, preview of contribution. **1.1 Motivación:** - Present the problem to solve - Justify importance to educational/scientific community - Answer: What problem? What are the causes? Why is it relevant? - Must include references to prior research **1.2 Planteamiento del trabajo:** - Briefly state the problem/need detected - Describe the proposal and purpose - Answer: How to solve? What is proposed? **1.3 Estructura del trabajo:** - Briefly describe what each subsequent chapter contains #### 2. Contexto y estado del arte Study the application domain in depth, citing numerous references. Must consult different sources (not just online - also technical manuals, books). **2.1 Contexto del problema:** - Deep study of the application domain **2.2 Estado del arte:** - Antecedents, current studies, comparison of existing tools - Must reference key authors in the field (justify exclusions) **2.3 Conclusiones:** - Summary linking research to the work to be done - How findings affect the specific development #### 3. Objetivos concretos y metodología de trabajo Bridge between domain study and contribution. Three required elements: (1) general objective, (2) specific objectives, (3) methodology. **3.1 Objetivo general:** - Must be SMART (Doran, 1981) - Focus on achieving an observable effect, not just "create a tool" - Example: "Mejorar el servicio X logrando Y valorado positivamente (mínimo 4/5) por Z" **3.2 Objetivos específicos:** - Divide general objective into analyzable sub-objectives - Must be SMART - Use infinitive verbs: Analizar, Calcular, Clasificar, Comparar, Conocer, Cuantificar, Desarrollar, Describir, Descubrir, Determinar, Establecer, Explorar, Identificar, Indagar, Medir, Sintetizar, Verificar - Typically ~5 objectives: 1-2 about state of art, 2-3 about development **3.3 Metodología del trabajo:** - Describe steps to achieve objectives - Explain WHY each step - What instruments will be used - How results will be analyzed #### 4. Desarrollo específico de la contribución Structure depends on work type. Organize by methodology phases/activities. **For Type 1 (Piloto experimental):** - 4.1 Descripción detallada del experimento - Technologies used (with justification) - How pilot was organized - Participants (demographics) - Automatic evaluation techniques - How experiment proceeded - Monitoring/evaluation instruments - Statistical analysis types - 4.2 Descripción de los resultados (objective, no interpretation) - Summary tables, result graphs, relevant data identification - 4.3 Discusión - Relevance of results, explanations for anomalies, highlight key findings **For Type 3 (Comparativa de soluciones):** - 4.1 Planteamiento de la comparativa - Problem identification, alternative solutions to evaluate - Success criteria, measures to take - 4.2 Desarrollo de la comparativa - All results and measurements obtained - Graphs, tables, data visualization - 4.3 Discusión y análisis de resultados - Discussion of meaning, advantages/disadvantages of solutions #### 5. Conclusiones y trabajo futuro **5.1 Conclusiones:** - Summary of problem, approach, and why solution is valid - Summary of contributions - **Relate contributions and results to objectives** - discuss degree of achievement **5.2 Líneas de trabajo futuro:** - Future work that would add value - Justify how contribution can be used and in what fields ### SMART Objectives Requirements ALL objectives (general and specific) MUST be SMART: | Criterion | Requirement | Example from this thesis | |-----------|-------------|-------------------------| | **S**pecific | Clearly define what to achieve | "Optimizar PaddleOCR para documentos en español" | | **M**easurable | Quantifiable success metric | "CER < 2%" | | **A**ttainable | Feasible with available resources | "Sin GPU, usando optimización de hiperparámetros" | | **R**elevant | Demonstrable impact | "Mejora extracción de texto en documentos académicos" | | **T**ime-bound | Achievable in timeframe | "Un cuatrimestre" | ### Citation and Reference Rules #### APA Format is MANDATORY Reference guide: https://bibliografiaycitas.unir.net/ **In-text citations:** - Single author: (Du, 2020) or Du (2020) - Two authors: (Du & Li, 2020) - Three+ authors: (Du et al., 2020) **Reference list examples:** ``` # Journal article with DOI Shi, B., Bai, X., & Yao, C. (2016). An end-to-end trainable neural network for image-based sequence recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(11), 2298-2304. https://doi.org/10.1109/TPAMI.2016.2646371 # Conference paper Akiba, T., Sano, S., Yanase, T., Ohta, T., & Koyama, M. (2019). Optuna: A next-generation hyperparameter optimization framework. Proceedings of the 25th ACM SIGKDD, 2623-2631. https://doi.org/10.1145/3292500.3330701 # arXiv preprint Du, Y., Li, C., Guo, R., ... & Wang, H. (2020). PP-OCR: A practical ultra lightweight OCR system. arXiv preprint arXiv:2009.09941. https://arxiv.org/abs/2009.09941 # Software/GitHub repository PaddlePaddle. (2024). PaddleOCR: Awesome multilingual OCR toolkits based on PaddlePaddle. GitHub. https://github.com/PaddlePaddle/PaddleOCR # Book Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Lawrence Erlbaum Associates. ``` #### Reference Rules - **NO Wikipedia citations** - Include variety: books, conferences, journal articles (not just URLs) - All cited references must appear in reference list - All references in list must be cited in text - Order alphabetically by first author's surname - Include DOI or URL when available ### Document Formatting Rules #### Page Setup | Element | Specification | |---------|--------------| | Page size | A4 | | Left margin | 3.0 cm | | Right margin | 2.0 cm | | Top/Bottom margins | 2.5 cm | | Header | Student name + TFE title | | Footer | Page number | #### Typography | Element | Format | |---------|--------| | Body text | Calibri 12, justified, 1.5 line spacing, 6pt before/after | | Título 1 | Calibri Light 18, blue, justified, 1.5 spacing | | Título 2 | Calibri Light 14, blue, justified, 1.5 spacing | | Título 3 | Calibri Light 12, justified, 1.5 spacing | | Footnotes | Calibri 10, justified, single spacing | | Code | Can reduce to 9pt if needed | #### Tables and Figures (from plantilla_individual.pdf) **Table format example:** ``` Tabla 1. Ejemplo de tabla con sus principales elementos. [TABLE CONTENT] Fuente: American Psychological Association, 2020a. ``` **Figure format example:** ``` Figura 1. Ejemplo de figura realizada para nuestro trabajo. [FIGURE] Fuente: American Psychological Association, 2020b. ``` **Rules:** - **Title position**: Above the table/figure - **Numbering format**: "**Tabla 1.**" / "**Figura 1.**" (Calibri 12, bold) - **Title text**: Calibri 12, italic (after the number) - **Source**: Below, centered, format "Fuente: Author, Year." - Can reduce font to 9pt for dense tables - Can use landscape orientation for large tables - Tables should have horizontal lines only (no vertical lines) per APA style ### Writing Style Rules #### MUST DO: - Each chapter starts with introductory paragraph explaining content - Each paragraph has at least 3 sentences - Verify originality (cite all sources) - Check spelling with Word corrector - Ensure logical flow between paragraphs - Define concepts and include pertinent citations #### MUST NOT DO: - Two consecutive headings without text between them - Superfluous phrases and repetition of ideas - Short paragraphs (less than 3 sentences) - Missing figure/table numbers or titles - Broken index generation ### Annexes Requirements **Anexo A - Código fuente y datos:** - Include repository URL where code is hosted - Student must be sole author and owner of repository - No commits from other users - Data used should also be in repository - If confidential (company project), justify why not shared ### Final Submission - **Drafts**: Submit in Word format - **Final deposit**: Submit in PDF format - Verify all indices generate correctly before final submission --- ## Guidelines for Claude ### CRITICAL: Academic Rigor Requirements **This is a Master's Thesis. Academic rigor is NON-NEGOTIABLE.** #### DO NOT: - **NEVER fabricate data or statistics** - Every number must come from an actual file in this repository - **NEVER invent comparison results** - If we don't have data for EasyOCR or DocTR comparisons, don't make up numbers - **NEVER assume or estimate values** - If a metric isn't in the CSV/notebook, don't include it - **NEVER extrapolate beyond what the data shows** - 24 pages is a limited dataset, acknowledge this - **NEVER claim results that weren't measured** - Only report what was actually computed #### ALWAYS: - **Read the source file first** before citing any result - **Quote exact values** from CSV files (e.g., CER 0.011535 not "approximately 1%") - **Reference the specific file and location** for every data point - **Acknowledge limitations** explicitly (dataset size, CPU-only, single document type) - **Distinguish between measured results and interpretations** #### Data Sources (ONLY use these): | Data Type | Source File | |-----------|-------------| | Ray Tune 64 trials | `src/raytune_paddle_subproc_results_20251207_192320.csv` | | Per-page benchmark | `results/ai_ocr_benchmark_finetune_results_20251206_113206.csv` | | Experiment code | `src/paddle_ocr_fine_tune_unir_raytune.ipynb` | | Final comparison | Output cells in the notebook (baseline vs optimized) | #### Example of WRONG vs RIGHT: **WRONG:** "EasyOCR achieved 8.5% CER while PaddleOCR achieved 5.2% CER" (We don't have this comparison data in our results files) **RIGHT:** "PaddleOCR with baseline configuration achieved CER between 1.54% and 6.40% across pages 5-9 (source: `results/ai_ocr_benchmark_finetune_results_20251206_113206.csv`)" **WRONG:** "The optimization improved results by approximately 80%" **RIGHT:** "The optimization reduced CER from 7.78% to 1.49%, a reduction of 80.9% (source: final comparison in `paddle_ocr_fine_tune_unir_raytune.ipynb`)" ### When Working on Documentation 1. **Read UNIR guidelines first**: Check `instructions/instrucciones.pdf` for structure requirements 2. **Follow chapter structure**: Each chapter has specific content requirements per UNIR guidelines 3. **References are UNIFIED**: All references go in `docs/06_referencias_bibliograficas.md`, NOT per-chapter 4. **Use APA format**: All citations must follow APA style 5. **Include "Fuentes de datos"**: Each chapter should list which repository files the data came from 6. **Language**: Documentation is in Spanish (thesis requirement), code comments in English 7. **Hardware context**: Remember this is CPU-only execution. Any suggestions about GPU training should acknowledge this limitation 8. **When in doubt, ask**: If the user requests data that doesn't exist, ask rather than inventing numbers ### Common Tasks - **Adding new experiments**: Update `src/paddle_ocr_fine_tune_unir_raytune.ipynb` - **Updating documentation**: Edit files in `docs/` - **Adding references**: Add to `docs/06_referencias_bibliograficas.md` (unified list) - **Dataset expansion**: Use `src/prepare_dataset.ipynb` as template - **Running evaluations**: Use `src/paddle_ocr_tuning.py` CLI --- ## Experiment Details ### Ray Tune Configuration ```python tuner = tune.Tuner( trainable_paddle_ocr, tune_config=tune.TuneConfig( metric="CER", mode="min", search_alg=OptunaSearch(), num_samples=64, max_concurrent_trials=2 ) ) ``` ### Dataset - Source: UNIR TFE instructions PDF - Pages: 24 - Resolution: 300 DPI - Ground truth: Extracted via PyMuPDF ### Metrics - CER (Character Error Rate) - Primary metric - WER (Word Error Rate) - Secondary metric - Calculated using `jiwer` library