This commit is contained in:
2025-12-16 00:00:58 +01:00
parent 647a8d1c7b
commit 5862a69ac2
4 changed files with 112 additions and 35 deletions

View File

@@ -59,6 +59,11 @@ MastersThesis/
│ ├── 05_conclusiones_trabajo_futuro.md # 5. Conclusiones (5.1, 5.2)
│ ├── 06_referencias_bibliograficas.md # Referencias bibliográficas (APA format)
│ └── 07_anexo_a.md # Anexo A: Código fuente y datos
├── thesis_output/ # Generated thesis document
│ ├── plantilla_individual.htm # Complete TFM (open in Word)
│ └── figures/ # PNG figures from Mermaid diagrams
│ ├── figura_1.png ... figura_7.png
│ └── figures_manifest.json
├── src/
│ ├── paddle_ocr_fine_tune_unir_raytune.ipynb # Main experiment (64 trials)
│ ├── paddle_ocr_tuning.py # CLI evaluation script
@@ -69,7 +74,9 @@ MastersThesis/
├── instructions/ # UNIR instructions and template
│ ├── instrucciones.pdf # TFE writing guidelines
│ ├── plantilla_individual.pdf # Word template (PDF version)
│ └── plantilla_individual.htm # Word template (HTML version, readable)
│ └── plantilla_individual.htm # Word template (HTML version, source)
├── apply_content.py # Generates TFM document from docs/ + template
├── generate_mermaid_figures.py # Converts Mermaid diagrams to PNG
├── ocr_benchmark_notebook.ipynb # Initial OCR benchmark
└── README.md
```
@@ -115,19 +122,48 @@ The template (`plantilla_individual.pdf`) requires **5 chapters**. The docs/ fil
### Completed Tasks
- [x] **Structure docs/ to match UNIR template** - All chapters now follow exact numbering (1.1, 1.2, etc.)
- [x] **Add Mermaid diagrams** - 4 diagrams added (OCR pipeline, Ray Tune architecture, CER comparison charts)
- [x] **Add Mermaid diagrams** - 7 diagrams added (OCR pipeline, Ray Tune architecture, methodology flowcharts, CER comparison charts)
- [x] **Generate unified thesis document** - `apply_content.py` generates complete document from docs/
- [x] **Convert Mermaid to PNG** - `generate_mermaid_figures.py` generates figures automatically
- [x] **Proper template formatting** - Tables/figures use `Piedefoto-tabla` class, references use `MsoBibliography`
### Priority Tasks
1. **Validate on other document types** - Test optimal config on invoices, forms, contracts
2. **Expand dataset** - Current dataset has only 24 pages
3. **Complete unified thesis document** - Merge docs/ chapters into final UNIR Word format
4. **Create presentation slides** - For thesis defense
3. **Create presentation slides** - For thesis defense
4. **Final document review** - Open in Word, update indices (Ctrl+A, F9), verify formatting
### Optional Extensions
- Explore `text_det_unclip_ratio` parameter (was fixed at 0.0)
- Compare with actual fine-tuning (if GPU access obtained)
- Multi-objective optimization (CER + WER + inference time)
## Thesis Document Generation
To regenerate the thesis document:
```bash
# 1. Generate PNG figures from Mermaid diagrams
python3 generate_mermaid_figures.py
# 2. Apply docs/ content to UNIR template
python3 apply_content.py
# 3. Open in Word and finalize
# - Open thesis_output/plantilla_individual.htm in Microsoft Word
# - Press Ctrl+A then F9 to update all indices
# - Save as .docx
```
**What `apply_content.py` does:**
- Replaces Resumen and Abstract with actual content + keywords
- Replaces all 5 chapters with content from docs/
- Replaces Referencias with APA-formatted bibliography
- Replaces Anexo with repository information
- Converts Mermaid diagrams to embedded PNG images
- Formats tables with `Piedefoto-tabla` captions and sources
- Removes template instruction text ("Importante:", "Ejemplo de nota al pie", etc.)
---
## UNIR TFE Document Guidelines