deliberable_16_12_2025
6
.claudeignore
Normal file
@@ -0,0 +1,6 @@
|
|||||||
|
~$*.docx
|
||||||
|
results/
|
||||||
|
__pycache__/
|
||||||
|
dataset
|
||||||
|
results
|
||||||
|
.DS_Store
|
||||||
7
.gitignore
vendored
@@ -1,3 +1,8 @@
|
|||||||
~$*.docx
|
~$*.docx
|
||||||
results/
|
results/
|
||||||
__pycache__/*
|
__pycache__/
|
||||||
|
dataset
|
||||||
|
results
|
||||||
|
.DS_Store
|
||||||
|
.claude
|
||||||
|
node_modules
|
||||||
|
|||||||
326
README.md
@@ -1,53 +1,311 @@
|
|||||||
# Sistema OCR multimotor con IA para PDFs escaneados en español
|
# Optimización de Hiperparámetros OCR con Ray Tune para Documentos Académicos en Español
|
||||||
|
|
||||||
**Trabajo Fin de Máster (TFM) – Tipo 2: Desarrollo de Software**
|
**Trabajo Fin de Máster (TFM) – Máster Universitario en Inteligencia Artificial**
|
||||||
**Líneas:** Percepción computacional · Aprendizaje automático
|
**Líneas:** Percepción computacional · Aprendizaje automático
|
||||||
**Autor:** Sergio Jiménez Jiménez · **UNIR** · **Año:** 2025
|
**Autor:** Sergio Jiménez Jiménez · **UNIR** · **Año:** 2025
|
||||||
|
|
||||||
> Extracción de texto desde **PDFs escaneados** en **español** mediante **motores OCR basados en IA** (EasyOCR · PaddleOCR · DocTR).
|
> Optimización sistemática de hiperparámetros de **PaddleOCR (PP-OCRv5)** mediante **Ray Tune** con **Optuna** para mejorar el reconocimiento óptico de caracteres en documentos académicos en español.
|
||||||
> Se excluyen soluciones clásicas como **Tesseract** o propietarias como **ABBYY**, centrando el proyecto en modelos neuronales modernos.
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## 🧭 Objetivo
|
## Objetivo
|
||||||
|
|
||||||
Desarrollar y evaluar un **sistema OCR multimotor** capaz de:
|
Optimizar el rendimiento de PaddleOCR para documentos académicos en español mediante ajuste de hiperparámetros, alcanzando un **CER inferior al 2%** sin requerir fine-tuning del modelo ni recursos GPU dedicados.
|
||||||
- Procesar PDFs escaneados extremo a extremo (**PDF → Imagen → Preprocesado → OCR → Evaluación**).
|
|
||||||
- **Reducir el CER al menos un 15 %** respecto a una línea base neuronal (EasyOCR).
|
|
||||||
- Mantener **tiempos por página** adecuados y un pipeline **modular y reproducible**.
|
|
||||||
|
|
||||||
**Métricas principales:**
|
**Resultado alcanzado:** CER = **1.49%** (objetivo cumplido)
|
||||||
- **CER** (*Character Error Rate*)
|
|
||||||
- **WER** (*Word Error Rate*)
|
|
||||||
- **Latencia por página*
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## 🧩 Alcance y diseño
|
## Resultados Principales
|
||||||
|
|
||||||
- **Idioma:** español (texto impreso, no manuscrito).
|
| Modelo | CER | Precisión Caracteres | WER | Precisión Palabras |
|
||||||
- **Entrada:** PDFs escaneados con calidad variable, ruido o inclinación.
|
|--------|-----|---------------------|-----|-------------------|
|
||||||
- **Motores evaluados:**
|
| PaddleOCR (Baseline) | 7.78% | 92.22% | 14.94% | 85.06% |
|
||||||
- **EasyOCR** – baseline neuronal ligera.
|
| **PaddleOCR-HyperAdjust** | **1.49%** | **98.51%** | **7.62%** | **92.38%** |
|
||||||
- **PaddleOCR (PP-OCR)** – referencia industrial multilingüe.
|
|
||||||
- **DocTR (Mindee)** – arquitectura PyTorch modular con salida estructurada.
|
|
||||||
- **Evaluación:** CER, WER y latencia promedio por página.
|
|
||||||
|
|
||||||
---
|
**Mejora obtenida:** Reducción del CER en un **80.9%**
|
||||||
|
|
||||||
## 🏗️ Arquitectura del sistema
|
### Configuración Óptima Encontrada
|
||||||
|
|
||||||
```text
|
```python
|
||||||
PDF (escaneado)
|
config_optimizada = {
|
||||||
└─► Conversión a imagen (PyMuPDF / pdf2image)
|
"textline_orientation": True, # CRÍTICO - reduce CER ~70%
|
||||||
└─► Preprocesado (OpenCV)
|
"use_doc_orientation_classify": False,
|
||||||
└─► OCR (EasyOCR | PaddleOCR | DocTR)
|
"use_doc_unwarping": False,
|
||||||
└─► Evaluación (CER · WER · latencia)
|
"text_det_thresh": 0.4690, # Correlación -0.52 con CER
|
||||||
|
"text_det_box_thresh": 0.5412,
|
||||||
|
"text_det_unclip_ratio": 0.0,
|
||||||
|
"text_rec_score_thresh": 0.6350,
|
||||||
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
## 🔜 Próximos pasos
|
---
|
||||||
|
|
||||||
1. Ajustar parámetros y arquitecturas en DocTR (detector y reconocedor).
|
## Metodología
|
||||||
2. Añadir métricas de latencia.
|
|
||||||
3. Incorporar postprocesamiento lingüístico (corrección ortográfica).
|
### Pipeline de Trabajo
|
||||||
4. Explorar TrOCR o MMOCR como comparación avanzada en la segunda fase.
|
|
||||||
|
```
|
||||||
|
PDF (académico UNIR)
|
||||||
|
└─► Conversión a imagen (PyMuPDF, 300 DPI)
|
||||||
|
└─► Extracción de ground truth
|
||||||
|
└─► OCR con PaddleOCR (PP-OCRv5)
|
||||||
|
└─► Evaluación (CER, WER con jiwer)
|
||||||
|
└─► Optimización (Ray Tune + Optuna)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Experimento de Optimización
|
||||||
|
|
||||||
|
| Parámetro | Valor |
|
||||||
|
|-----------|-------|
|
||||||
|
| Número de trials | 64 |
|
||||||
|
| Algoritmo de búsqueda | OptunaSearch (TPE) |
|
||||||
|
| Métrica objetivo | CER (minimizar) |
|
||||||
|
| Trials concurrentes | 2 |
|
||||||
|
| Tiempo total | ~6 horas (CPU) |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Estructura del Repositorio
|
||||||
|
|
||||||
|
```
|
||||||
|
MastersThesis/
|
||||||
|
├── docs/ # Capítulos del TFM en Markdown (estructura UNIR)
|
||||||
|
│ ├── 00_resumen.md # Resumen + Abstract + Keywords
|
||||||
|
│ ├── 01_introduccion.md # Cap. 1: Introducción (1.1-1.3)
|
||||||
|
│ ├── 02_contexto_estado_arte.md # Cap. 2: Contexto y estado del arte (2.1-2.3)
|
||||||
|
│ ├── 03_objetivos_metodologia.md # Cap. 3: Objetivos y metodología (3.1-3.4)
|
||||||
|
│ ├── 04_desarrollo_especifico.md # Cap. 4: Desarrollo específico (4.1-4.3)
|
||||||
|
│ ├── 05_conclusiones_trabajo_futuro.md # Cap. 5: Conclusiones (5.1-5.2)
|
||||||
|
│ ├── 06_referencias_bibliograficas.md # Referencias bibliográficas (APA)
|
||||||
|
│ └── 07_anexo_a.md # Anexo A: Código fuente y datos
|
||||||
|
├── thesis_output/ # Documento final generado
|
||||||
|
│ ├── plantilla_individual.htm # TFM completo (abrir en Word)
|
||||||
|
│ └── figures/ # Figuras generadas desde Mermaid
|
||||||
|
│ ├── figura_1.png ... figura_7.png
|
||||||
|
│ └── figures_manifest.json
|
||||||
|
├── src/
|
||||||
|
│ ├── paddle_ocr_fine_tune_unir_raytune.ipynb # Experimento principal
|
||||||
|
│ ├── paddle_ocr_tuning.py # Script de evaluación CLI
|
||||||
|
│ ├── dataset_manager.py # Clase ImageTextDataset
|
||||||
|
│ ├── prepare_dataset.ipynb # Preparación del dataset
|
||||||
|
│ └── raytune_paddle_subproc_results_*.csv # Resultados de 64 trials
|
||||||
|
├── results/ # Resultados de benchmarks
|
||||||
|
├── instructions/ # Plantilla e instrucciones UNIR
|
||||||
|
│ ├── instrucciones.pdf
|
||||||
|
│ ├── plantilla_individual.pdf
|
||||||
|
│ └── plantilla_individual.htm
|
||||||
|
├── apply_content.py # Genera documento TFM desde docs/ + plantilla
|
||||||
|
├── generate_mermaid_figures.py # Convierte diagramas Mermaid a PNG
|
||||||
|
├── ocr_benchmark_notebook.ipynb # Benchmark comparativo inicial
|
||||||
|
└── README.md
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Hallazgos Clave
|
||||||
|
|
||||||
|
1. **`textline_orientation=True` es crítico**: Reduce el CER en un 69.7%. Para documentos con layouts mixtos (tablas, encabezados), la clasificación de orientación de línea es esencial.
|
||||||
|
|
||||||
|
2. **Umbral `text_det_thresh` importante**: Correlación -0.52 con CER. Valores óptimos entre 0.4-0.5. Valores < 0.1 causan fallos catastróficos (CER >40%).
|
||||||
|
|
||||||
|
3. **Componentes innecesarios para PDFs digitales**: `use_doc_orientation_classify` y `use_doc_unwarping` no mejoran el rendimiento en documentos académicos digitales.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Requisitos
|
||||||
|
|
||||||
|
| Componente | Versión |
|
||||||
|
|------------|---------|
|
||||||
|
| Python | 3.11.9 |
|
||||||
|
| PaddlePaddle | 3.2.2 |
|
||||||
|
| PaddleOCR | 3.3.2 |
|
||||||
|
| Ray | 2.52.1 |
|
||||||
|
| Optuna | 4.6.0 |
|
||||||
|
| jiwer | (para métricas CER/WER) |
|
||||||
|
| PyMuPDF | (para conversión PDF) |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Uso
|
||||||
|
|
||||||
|
### Preparar dataset
|
||||||
|
```bash
|
||||||
|
# Ejecutar prepare_dataset.ipynb para convertir PDF a imágenes y extraer ground truth
|
||||||
|
jupyter notebook src/prepare_dataset.ipynb
|
||||||
|
```
|
||||||
|
|
||||||
|
### Ejecutar optimización
|
||||||
|
```bash
|
||||||
|
# Ejecutar el notebook principal de Ray Tune
|
||||||
|
jupyter notebook src/paddle_ocr_fine_tune_unir_raytune.ipynb
|
||||||
|
```
|
||||||
|
|
||||||
|
### Evaluación individual
|
||||||
|
```bash
|
||||||
|
python src/paddle_ocr_tuning.py \
|
||||||
|
--pdf-folder ./dataset \
|
||||||
|
--textline-orientation True \
|
||||||
|
--text-det-thresh 0.469 \
|
||||||
|
--text-det-box-thresh 0.541 \
|
||||||
|
--text-rec-score-thresh 0.635
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Fuentes de Datos
|
||||||
|
|
||||||
|
- **Dataset**: Instrucciones para la elaboración del TFE (UNIR), 24 páginas
|
||||||
|
- **Resultados Ray Tune (PRINCIPAL)**: `src/raytune_paddle_subproc_results_20251207_192320.csv` - 64 trials de optimización con todas las métricas y configuraciones
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Generación del Documento TFM
|
||||||
|
|
||||||
|
### Prerrequisitos
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Instalar dependencias de Python
|
||||||
|
pip install beautifulsoup4
|
||||||
|
|
||||||
|
# Instalar mermaid-cli para generación de figuras
|
||||||
|
npm install @mermaid-js/mermaid-cli
|
||||||
|
```
|
||||||
|
|
||||||
|
### Flujo de Generación del Documento
|
||||||
|
|
||||||
|
El documento TFM se genera en **3 pasos** que deben ejecutarse en orden:
|
||||||
|
|
||||||
|
```
|
||||||
|
┌─────────────────────────────────────────────────────────────────────┐
|
||||||
|
│ PASO 1: generate_mermaid_figures.py │
|
||||||
|
│ ────────────────────────────────────────────────────────────────── │
|
||||||
|
│ • Lee diagramas Mermaid de docs/*.md │
|
||||||
|
│ • Genera thesis_output/figures/figura_*.png │
|
||||||
|
│ • Crea figures_manifest.json con títulos │
|
||||||
|
└─────────────────────────────────────────────────────────────────────┘
|
||||||
|
↓
|
||||||
|
┌─────────────────────────────────────────────────────────────────────┐
|
||||||
|
│ PASO 2: apply_content.py │
|
||||||
|
│ ────────────────────────────────────────────────────────────────── │
|
||||||
|
│ • Lee plantilla desde instructions/plantilla_individual.htm │
|
||||||
|
│ • Inserta contenido de docs/*.md en cada capítulo │
|
||||||
|
│ • Genera tablas con formato APA y figuras con referencias │
|
||||||
|
│ • Guarda en thesis_output/plantilla_individual.htm │
|
||||||
|
└─────────────────────────────────────────────────────────────────────┘
|
||||||
|
↓
|
||||||
|
┌─────────────────────────────────────────────────────────────────────┐
|
||||||
|
│ PASO 3: Abrir en Microsoft Word │
|
||||||
|
│ ────────────────────────────────────────────────────────────────── │
|
||||||
|
│ • Abrir thesis_output/plantilla_individual.htm │
|
||||||
|
│ • Ctrl+A → F9 para actualizar índices (contenidos/figuras/tablas) │
|
||||||
|
│ • Guardar como TFM_Sergio_Jimenez.docx │
|
||||||
|
└─────────────────────────────────────────────────────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
### Comandos de Generación
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Desde el directorio raíz del proyecto:
|
||||||
|
|
||||||
|
# PASO 1: Generar figuras PNG desde diagramas Mermaid
|
||||||
|
python3 generate_mermaid_figures.py
|
||||||
|
# Output: thesis_output/figures/figura_1.png ... figura_8.png
|
||||||
|
|
||||||
|
# PASO 2: Aplicar contenido de docs/ a la plantilla UNIR
|
||||||
|
python3 apply_content.py
|
||||||
|
# Output: thesis_output/plantilla_individual.htm
|
||||||
|
|
||||||
|
# PASO 3: Abrir en Word y finalizar documento
|
||||||
|
# - Abrir thesis_output/plantilla_individual.htm en Microsoft Word
|
||||||
|
# - Ctrl+A → F9 para actualizar todos los índices
|
||||||
|
# - IMPORTANTE: Ajustar manualmente el tamaño de las imágenes para legibilidad
|
||||||
|
# (seleccionar imagen → clic derecho → Tamaño y posición → ajustar al ancho de página)
|
||||||
|
# - Guardar como .docx
|
||||||
|
```
|
||||||
|
|
||||||
|
### Notas Importantes para Edición en Word
|
||||||
|
|
||||||
|
1. **Ajuste de imágenes**: Las figuras Mermaid pueden requerir ajuste manual de tamaño para ser legibles. Seleccionar cada imagen y ajustar al ancho de texto (~16cm).
|
||||||
|
|
||||||
|
2. **Actualización de índices**: Después de cualquier cambio, usar Ctrl+A → F9 para regenerar índices.
|
||||||
|
|
||||||
|
3. **Formato de código**: Los bloques de código usan Consolas 9pt. Verificar que no se corten líneas largas.
|
||||||
|
|
||||||
|
### Archivos de Entrada y Salida
|
||||||
|
|
||||||
|
| Script | Entrada | Salida |
|
||||||
|
|--------|---------|--------|
|
||||||
|
| `generate_mermaid_figures.py` | `docs/*.md` (bloques ```mermaid```) | `thesis_output/figures/figura_*.png`, `figures_manifest.json` |
|
||||||
|
| `apply_content.py` | `instructions/plantilla_individual.htm`, `docs/*.md`, `thesis_output/figures/*.png` | `thesis_output/plantilla_individual.htm` |
|
||||||
|
|
||||||
|
### Contenido Generado Automáticamente
|
||||||
|
|
||||||
|
- **30 tablas** con formato APA (Tabla X. *Título* + Fuente: ...)
|
||||||
|
- **8 figuras** desde Mermaid (Figura X. *Título* + Fuente: Elaboración propia)
|
||||||
|
- **25 referencias** en formato APA con sangría francesa
|
||||||
|
- **Resumen/Abstract** con palabras clave
|
||||||
|
- **Índices** actualizables (contenidos, figuras, tablas)
|
||||||
|
- Eliminación automática de textos de instrucción de la plantilla
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Trabajo Pendiente para Completar el TFM
|
||||||
|
|
||||||
|
### Contexto: Limitaciones de Hardware
|
||||||
|
|
||||||
|
Este trabajo adoptó la estrategia de **optimización de hiperparámetros** en lugar de **fine-tuning** debido a:
|
||||||
|
- **Sin GPU dedicada**: Ejecución exclusivamente en CPU
|
||||||
|
- **Tiempo de inferencia elevado**: ~69 segundos/página en CPU
|
||||||
|
- **Fine-tuning inviable**: Entrenar modelos de deep learning sin GPU requeriría tiempos prohibitivos
|
||||||
|
|
||||||
|
La optimización de hiperparámetros demostró ser una **alternativa efectiva** al fine-tuning, logrando una reducción del 80.9% en el CER sin reentrenar el modelo.
|
||||||
|
|
||||||
|
### Tareas Completadas
|
||||||
|
|
||||||
|
- [x] **Estructura docs/ según plantilla UNIR**: Todos los capítulos siguen numeración exacta (1.1, 1.2, etc.)
|
||||||
|
- [x] **Añadir diagramas Mermaid**: 7 diagramas añadidos (pipeline OCR, arquitectura Ray Tune, gráficos de comparación)
|
||||||
|
- [x] **Generar documento TFM unificado**: Script `apply_content.py` genera documento completo desde docs/
|
||||||
|
- [x] **Convertir Mermaid a PNG**: Script `generate_mermaid_figures.py` genera figuras automáticamente
|
||||||
|
|
||||||
|
### Tareas Pendientes
|
||||||
|
|
||||||
|
#### 1. Validación del Enfoque (Prioridad Alta)
|
||||||
|
- [ ] **Validación cruzada en otros documentos**: Evaluar la configuración óptima en otros tipos de documentos en español (facturas, formularios, contratos) para verificar generalización
|
||||||
|
- [ ] **Ampliar el dataset**: El dataset actual tiene solo 24 páginas. Construir un corpus más amplio y diverso (mínimo 100 páginas)
|
||||||
|
- [ ] **Validación del ground truth**: Revisar manualmente el texto de referencia extraído automáticamente para asegurar su exactitud
|
||||||
|
|
||||||
|
#### 2. Experimentación Adicional (Prioridad Media)
|
||||||
|
- [ ] **Explorar `text_det_unclip_ratio`**: Este parámetro quedó fijado en 0.0. Incluirlo en el espacio de búsqueda podría mejorar resultados
|
||||||
|
- [ ] **Comparativa con fine-tuning** (si se obtiene acceso a GPU): Cuantificar la brecha de rendimiento entre optimización de hiperparámetros y fine-tuning real
|
||||||
|
- [ ] **Evaluación con GPU**: Medir tiempos de inferencia con aceleración GPU para escenarios de producción
|
||||||
|
|
||||||
|
#### 3. Documentación y Presentación (Prioridad Alta)
|
||||||
|
- [ ] **Crear presentación**: Preparar slides para la defensa del TFM
|
||||||
|
- [ ] **Revisión final del documento**: Verificar formato, índices y contenido en Word
|
||||||
|
|
||||||
|
#### 4. Extensiones Futuras (Opcional)
|
||||||
|
- [ ] **Herramienta de configuración automática**: Desarrollar una herramienta que determine automáticamente la configuración óptima para un nuevo tipo de documento
|
||||||
|
- [ ] **Benchmark público para español**: Publicar un benchmark de OCR para documentos en español que facilite comparación de soluciones
|
||||||
|
- [ ] **Optimización multi-objetivo**: Considerar CER, WER y tiempo de inferencia simultáneamente
|
||||||
|
|
||||||
|
### Recomendación de Próximos Pasos
|
||||||
|
|
||||||
|
1. **Inmediato**: Abrir documento generado en Word, actualizar índices (Ctrl+A, F9), guardar como .docx
|
||||||
|
2. **Corto plazo**: Validar en 2-3 tipos de documentos adicionales para demostrar generalización
|
||||||
|
3. **Para la defensa**: Crear presentación con visualizaciones de resultados
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Licencia
|
||||||
|
|
||||||
|
Este proyecto es parte de un Trabajo Fin de Máster académico.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Referencias
|
||||||
|
|
||||||
|
- [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)
|
||||||
|
- [Ray Tune](https://docs.ray.io/en/latest/tune/index.html)
|
||||||
|
- [Optuna](https://optuna.org/)
|
||||||
|
- [jiwer](https://github.com/jitsi/jiwer)
|
||||||
|
|||||||
BIN
TFM_Sergio_Jimenez_OCR.docx
Normal file
BIN
TFM_Sergio_Jimenez_OCR.pdf
Normal file
609
apply_content.py
Normal file
@@ -0,0 +1,609 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""Replace template content with thesis content from docs/ folder using BeautifulSoup."""
|
||||||
|
|
||||||
|
import re
|
||||||
|
import os
|
||||||
|
from bs4 import BeautifulSoup, NavigableString
|
||||||
|
|
||||||
|
BASE_DIR = '/Users/sergio/Desktop/MastersThesis'
|
||||||
|
TEMPLATE = os.path.join(BASE_DIR, 'thesis_output/plantilla_individual.htm')
|
||||||
|
DOCS_DIR = os.path.join(BASE_DIR, 'docs')
|
||||||
|
|
||||||
|
# Global counters for tables and figures
|
||||||
|
table_counter = 0
|
||||||
|
figure_counter = 0
|
||||||
|
|
||||||
|
def read_file(path):
|
||||||
|
try:
|
||||||
|
with open(path, 'r', encoding='utf-8') as f:
|
||||||
|
return f.read()
|
||||||
|
except UnicodeDecodeError:
|
||||||
|
with open(path, 'r', encoding='latin-1') as f:
|
||||||
|
return f.read()
|
||||||
|
|
||||||
|
def write_file(path, content):
|
||||||
|
with open(path, 'w', encoding='utf-8') as f:
|
||||||
|
f.write(content)
|
||||||
|
|
||||||
|
def md_to_html_para(text):
|
||||||
|
"""Convert markdown inline formatting to HTML."""
|
||||||
|
# Bold
|
||||||
|
text = re.sub(r'\*\*([^*]+)\*\*', r'<b>\1</b>', text)
|
||||||
|
# Italic
|
||||||
|
text = re.sub(r'\*([^*]+)\*', r'<i>\1</i>', text)
|
||||||
|
# Inline code
|
||||||
|
text = re.sub(r'`([^`]+)`', r'<span style="font-family:Consolas;font-size:10pt">\1</span>', text)
|
||||||
|
return text
|
||||||
|
|
||||||
|
def extract_table_title(lines, current_index):
|
||||||
|
"""Look for table title in preceding lines (e.g., **Tabla 1.** *Title*)."""
|
||||||
|
# Check previous non-empty lines for table title
|
||||||
|
for i in range(current_index - 1, max(0, current_index - 5), -1):
|
||||||
|
line = lines[i].strip()
|
||||||
|
if line.startswith('**Tabla') or line.startswith('*Tabla'):
|
||||||
|
return line
|
||||||
|
if line and not line.startswith('|'):
|
||||||
|
break
|
||||||
|
return None
|
||||||
|
|
||||||
|
def extract_figure_title_from_mermaid(lines, current_index):
|
||||||
|
"""Extract title from mermaid diagram or preceding text."""
|
||||||
|
# Look for title in mermaid content
|
||||||
|
for i in range(current_index + 1, min(len(lines), current_index + 20)):
|
||||||
|
line = lines[i].strip()
|
||||||
|
if line.startswith('```'):
|
||||||
|
break
|
||||||
|
if 'title' in line.lower():
|
||||||
|
# Extract title from: title "Some Title"
|
||||||
|
match = re.search(r'title\s+["\']([^"\']+)["\']', line)
|
||||||
|
if match:
|
||||||
|
return match.group(1)
|
||||||
|
|
||||||
|
# Check preceding lines for figure reference
|
||||||
|
for i in range(current_index - 1, max(0, current_index - 3), -1):
|
||||||
|
line = lines[i].strip()
|
||||||
|
if line.startswith('**Figura') or 'Figura' in line:
|
||||||
|
return line
|
||||||
|
|
||||||
|
return None
|
||||||
|
|
||||||
|
def parse_md_to_html_blocks(md_content):
|
||||||
|
"""Convert markdown content to HTML blocks with template styles."""
|
||||||
|
global table_counter, figure_counter
|
||||||
|
|
||||||
|
html_blocks = []
|
||||||
|
lines = md_content.split('\n')
|
||||||
|
i = 0
|
||||||
|
|
||||||
|
while i < len(lines):
|
||||||
|
line = lines[i]
|
||||||
|
|
||||||
|
# Skip empty lines
|
||||||
|
if not line.strip():
|
||||||
|
i += 1
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Mermaid diagram - convert to figure with actual image
|
||||||
|
if line.strip().startswith('```mermaid'):
|
||||||
|
figure_counter += 1
|
||||||
|
mermaid_lines = []
|
||||||
|
i += 1
|
||||||
|
while i < len(lines) and not lines[i].strip() == '```':
|
||||||
|
mermaid_lines.append(lines[i])
|
||||||
|
i += 1
|
||||||
|
|
||||||
|
# Try to extract title from mermaid content (YAML format: title: "...")
|
||||||
|
mermaid_content = '\n'.join(mermaid_lines)
|
||||||
|
# Match YAML format: title: "Title" or title: 'Title'
|
||||||
|
title_match = re.search(r'title:\s*["\']([^"\']+)["\']', mermaid_content)
|
||||||
|
if not title_match:
|
||||||
|
# Fallback to non-YAML format: title "Title"
|
||||||
|
title_match = re.search(r'title\s+["\']?([^"\'"\n]+)["\']?', mermaid_content)
|
||||||
|
if title_match:
|
||||||
|
fig_title = title_match.group(1).strip()
|
||||||
|
else:
|
||||||
|
fig_title = f"Diagrama {figure_counter}"
|
||||||
|
|
||||||
|
# Check if the generated PNG exists
|
||||||
|
fig_file = f'figures/figura_{figure_counter}.png'
|
||||||
|
fig_path = os.path.join(BASE_DIR, 'thesis_output', fig_file)
|
||||||
|
|
||||||
|
# Create figure with MsoCaption class and proper Word SEQ field for cross-reference
|
||||||
|
# Format: "Figura X." in bold, title in italic (per UNIR guidelines)
|
||||||
|
# Word TOC looks for text with Caption style - anchor must be outside main caption text
|
||||||
|
bookmark_id = f"_Ref_Fig{figure_counter}"
|
||||||
|
html_blocks.append(f'''<a name="{bookmark_id}"></a><p class=MsoCaption style="text-align:center"><b><span lang=ES style="font-size:12.0pt;line-height:150%">Figura <!--[if supportFields]><span style='mso-element:field-begin'></span> SEQ Figura \\* ARABIC <span style='mso-element:field-separator'></span><![endif]-->{figure_counter}<!--[if supportFields]><span style='mso-element:field-end'></span><![endif]-->.</span></b><span lang=ES style="font-size:12.0pt;line-height:150%"> </span><i><span lang=ES style="font-size:12.0pt;line-height:150%">{fig_title}</span></i></p>''')
|
||||||
|
|
||||||
|
if os.path.exists(fig_path):
|
||||||
|
# Use Word-compatible width in cm (A4 text area is ~16cm wide, use ~12cm max)
|
||||||
|
html_blocks.append(f'''<p class=MsoNormal style="text-align:center"><span lang=ES><img style="width:12cm;max-width:100%" src="{fig_file}" alt="{fig_title}"/></span></p>''')
|
||||||
|
else:
|
||||||
|
# Fallback to placeholder
|
||||||
|
html_blocks.append(f'''<p class=MsoNormal style="text-align:center;border:1px dashed #999;padding:20px;margin:10px 40px;background:#f9f9f9"><span lang=ES style="color:#666">[Insertar diagrama Mermaid aquí]</span></p>''')
|
||||||
|
|
||||||
|
html_blocks.append(f'''<p class=Piedefoto-tabla style="margin-left:0cm;text-align:center"><span lang=ES>Fuente: Elaboración propia.</span></p>''')
|
||||||
|
html_blocks.append('<p class=MsoNormal><span lang=ES><o:p> </o:p></span></p>')
|
||||||
|
i += 1
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Code block (non-mermaid)
|
||||||
|
if line.strip().startswith('```'):
|
||||||
|
code_lang = line.strip()[3:]
|
||||||
|
code_lines = []
|
||||||
|
i += 1
|
||||||
|
while i < len(lines) and not lines[i].strip().startswith('```'):
|
||||||
|
code_lines.append(lines[i])
|
||||||
|
i += 1
|
||||||
|
code = '\n'.join(code_lines)
|
||||||
|
# Escape HTML entities in code
|
||||||
|
code = code.replace('&', '&').replace('<', '<').replace('>', '>')
|
||||||
|
html_blocks.append(f'<p class=MsoNormal style="margin-left:1cm"><span style="font-family:Consolas;font-size:9pt"><pre>{code}</pre></span></p>')
|
||||||
|
i += 1
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Headers - ## becomes h2, ### becomes h3
|
||||||
|
if line.startswith('####'):
|
||||||
|
text = line.lstrip('#').strip()
|
||||||
|
html_blocks.append(f'<h4><span lang=ES>{text}</span></h4>')
|
||||||
|
i += 1
|
||||||
|
continue
|
||||||
|
elif line.startswith('###'):
|
||||||
|
text = line.lstrip('#').strip()
|
||||||
|
html_blocks.append(f'<h3 style="mso-list:l22 level3 lfo18"><span lang=ES style="text-transform:none">{text}</span></h3>')
|
||||||
|
i += 1
|
||||||
|
continue
|
||||||
|
elif line.startswith('##'):
|
||||||
|
text = line.lstrip('#').strip()
|
||||||
|
html_blocks.append(f'<h2 style="mso-list:l22 level2 lfo18"><span lang=ES style="text-transform:none">{text}</span></h2>')
|
||||||
|
i += 1
|
||||||
|
continue
|
||||||
|
elif line.startswith('#'):
|
||||||
|
# Skip h1 - we keep the original
|
||||||
|
i += 1
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Table - check for table title pattern first
|
||||||
|
if '|' in line and i + 1 < len(lines) and '---' in lines[i + 1]:
|
||||||
|
table_counter += 1
|
||||||
|
|
||||||
|
# Check if previous line has table title (e.g., **Tabla 1.** *Title*)
|
||||||
|
table_title = None
|
||||||
|
table_source = "Elaboración propia"
|
||||||
|
|
||||||
|
# Look back for table title
|
||||||
|
for j in range(i - 1, max(0, i - 5), -1):
|
||||||
|
prev_line = lines[j].strip()
|
||||||
|
if prev_line.startswith('**Tabla') or prev_line.startswith('*Tabla'):
|
||||||
|
# Extract title text
|
||||||
|
table_title = re.sub(r'\*+', '', prev_line).strip()
|
||||||
|
break
|
||||||
|
elif prev_line and not prev_line.startswith('|'):
|
||||||
|
break
|
||||||
|
|
||||||
|
# Parse table
|
||||||
|
table_lines = []
|
||||||
|
while i < len(lines) and '|' in lines[i]:
|
||||||
|
if '---' not in lines[i]:
|
||||||
|
table_lines.append(lines[i])
|
||||||
|
i += 1
|
||||||
|
|
||||||
|
# Look ahead for source
|
||||||
|
if i < len(lines) and 'Fuente:' in lines[i]:
|
||||||
|
table_source = lines[i].replace('*', '').replace('Fuente:', '').strip()
|
||||||
|
i += 1
|
||||||
|
|
||||||
|
# Add table title with MsoCaption class and proper Word SEQ field for cross-reference
|
||||||
|
# Format: "Tabla X." in bold, title in italic (per UNIR guidelines)
|
||||||
|
# Word TOC looks for text with Caption style - anchor must be outside main caption text
|
||||||
|
bookmark_id = f"_Ref_Tab{table_counter}"
|
||||||
|
if table_title:
|
||||||
|
clean_title = table_title.replace(f"Tabla {table_counter}.", "").strip()
|
||||||
|
else:
|
||||||
|
clean_title = "Tabla de datos."
|
||||||
|
html_blocks.append(f'''<a name="{bookmark_id}"></a><p class=MsoCaption><b><span lang=ES style="font-size:12.0pt;line-height:150%">Tabla <!--[if supportFields]><span style='mso-element:field-begin'></span> SEQ Tabla \\* ARABIC <span style='mso-element:field-separator'></span><![endif]-->{table_counter}<!--[if supportFields]><span style='mso-element:field-end'></span><![endif]-->.</span></b><span lang=ES style="font-size:12.0pt;line-height:150%"> </span><i><span lang=ES style="font-size:12.0pt;line-height:150%">{clean_title}</span></i></p>''')
|
||||||
|
|
||||||
|
# Build table HTML with APA style (horizontal lines only, no vertical)
|
||||||
|
table_html = '<table class=MsoTableGrid border=0 cellspacing=0 cellpadding=0 style="border-collapse:collapse;border:none">'
|
||||||
|
for j, tline in enumerate(table_lines):
|
||||||
|
cells = [c.strip() for c in tline.split('|')[1:-1]]
|
||||||
|
table_html += '<tr>'
|
||||||
|
for cell in cells:
|
||||||
|
if j == 0:
|
||||||
|
# Header row: top and bottom border, bold text
|
||||||
|
table_html += f'<td style="border-top:solid windowtext 1.0pt;border-bottom:solid windowtext 1.0pt;border-left:none;border-right:none;padding:5px"><p class=MsoNormal style="margin:0"><b><span lang=ES>{md_to_html_para(cell)}</span></b></p></td>'
|
||||||
|
elif j == len(table_lines) - 1:
|
||||||
|
# Last row: bottom border only
|
||||||
|
table_html += f'<td style="border-top:none;border-bottom:solid windowtext 1.0pt;border-left:none;border-right:none;padding:5px"><p class=MsoNormal style="margin:0"><span lang=ES>{md_to_html_para(cell)}</span></p></td>'
|
||||||
|
else:
|
||||||
|
# Middle rows: no borders
|
||||||
|
table_html += f'<td style="border:none;padding:5px"><p class=MsoNormal style="margin:0"><span lang=ES>{md_to_html_para(cell)}</span></p></td>'
|
||||||
|
table_html += '</tr>'
|
||||||
|
table_html += '</table>'
|
||||||
|
html_blocks.append(table_html)
|
||||||
|
|
||||||
|
# Add source with proper template format
|
||||||
|
html_blocks.append(f'<p class=Piedefoto-tabla style="margin-left:0cm"><span lang=ES>Fuente: {table_source}.</span></p>')
|
||||||
|
html_blocks.append('<p class=MsoNormal><span lang=ES><o:p> </o:p></span></p>')
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Blockquote
|
||||||
|
if line.startswith('>'):
|
||||||
|
quote_text = line[1:].strip()
|
||||||
|
i += 1
|
||||||
|
while i < len(lines) and lines[i].startswith('>'):
|
||||||
|
quote_text += ' ' + lines[i][1:].strip()
|
||||||
|
i += 1
|
||||||
|
html_blocks.append(f'<p class=MsoNormal style="margin-left:2cm;margin-right:1cm"><i><span lang=ES>{md_to_html_para(quote_text)}</span></i></p>')
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Bullet list
|
||||||
|
if re.match(r'^[\-\*\+]\s', line):
|
||||||
|
while i < len(lines) and re.match(r'^[\-\*\+]\s', lines[i]):
|
||||||
|
item_text = lines[i][2:].strip()
|
||||||
|
html_blocks.append(f'<p class=MsoListParagraphCxSpMiddle style="margin-left:36pt;text-indent:-18pt"><span lang=ES style="font-family:Symbol">·</span><span lang=ES style="font-size:7pt"> </span><span lang=ES>{md_to_html_para(item_text)}</span></p>')
|
||||||
|
i += 1
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Numbered list
|
||||||
|
if re.match(r'^\d+\.\s', line):
|
||||||
|
num = 1
|
||||||
|
while i < len(lines) and re.match(r'^\d+\.\s', lines[i]):
|
||||||
|
item_text = re.sub(r'^\d+\.\s*', '', lines[i]).strip()
|
||||||
|
html_blocks.append(f'<p class=MsoListParagraphCxSpMiddle style="margin-left:36pt;text-indent:-18pt"><span lang=ES>{num}.<span style="font-size:7pt"> </span>{md_to_html_para(item_text)}</span></p>')
|
||||||
|
num += 1
|
||||||
|
i += 1
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Skip lines that are just table/figure titles (they'll be handled with the table/figure)
|
||||||
|
if line.strip().startswith('**Tabla') or line.strip().startswith('*Tabla'):
|
||||||
|
i += 1
|
||||||
|
continue
|
||||||
|
if line.strip().startswith('**Figura') or line.strip().startswith('*Figura'):
|
||||||
|
i += 1
|
||||||
|
continue
|
||||||
|
if line.strip().startswith('*Fuente:') or line.strip().startswith('Fuente:'):
|
||||||
|
i += 1
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Regular paragraph
|
||||||
|
para_lines = [line]
|
||||||
|
i += 1
|
||||||
|
while i < len(lines) and lines[i].strip() and not lines[i].startswith('#') and not lines[i].startswith('```') and not lines[i].startswith('>') and not re.match(r'^[\-\*\+]\s', lines[i]) and not re.match(r'^\d+\.\s', lines[i]) and '|' not in lines[i]:
|
||||||
|
para_lines.append(lines[i])
|
||||||
|
i += 1
|
||||||
|
|
||||||
|
para_text = ' '.join(para_lines)
|
||||||
|
html_blocks.append(f'<p class=MsoNormal><span lang=ES>{md_to_html_para(para_text)}</span></p>')
|
||||||
|
|
||||||
|
return '\n\n'.join(html_blocks)
|
||||||
|
|
||||||
|
def extract_section_content(md_content):
|
||||||
|
"""Extract content from markdown, skipping the first # header."""
|
||||||
|
md_content = re.sub(r'^#\s+[^\n]+\n+', '', md_content, count=1)
|
||||||
|
return parse_md_to_html_blocks(md_content)
|
||||||
|
|
||||||
|
def find_section_element(soup, keyword):
|
||||||
|
"""Find element containing keyword (h1 or special paragraph classes)."""
|
||||||
|
# First try h1
|
||||||
|
for h1 in soup.find_all('h1'):
|
||||||
|
text = h1.get_text()
|
||||||
|
if keyword.lower() in text.lower():
|
||||||
|
return h1
|
||||||
|
|
||||||
|
# Try special paragraph classes for unnumbered sections
|
||||||
|
for p in soup.find_all('p', class_=['Ttulo1sinnumerar', 'Anexo', 'MsoNormal']):
|
||||||
|
text = p.get_text()
|
||||||
|
if keyword.lower() in text.lower():
|
||||||
|
classes = p.get('class', [])
|
||||||
|
if 'Ttulo1sinnumerar' in classes or 'Anexo' in classes:
|
||||||
|
return p
|
||||||
|
if re.match(r'^\d+\.?\s', text.strip()):
|
||||||
|
return p
|
||||||
|
return None
|
||||||
|
|
||||||
|
def remove_elements_between(start_elem, end_elem):
|
||||||
|
"""Remove all elements between start and end (exclusive)."""
|
||||||
|
current = start_elem.next_sibling
|
||||||
|
elements_to_remove = []
|
||||||
|
while current and current != end_elem:
|
||||||
|
elements_to_remove.append(current)
|
||||||
|
current = current.next_sibling
|
||||||
|
for elem in elements_to_remove:
|
||||||
|
if hasattr(elem, 'decompose'):
|
||||||
|
elem.decompose()
|
||||||
|
elif isinstance(elem, NavigableString):
|
||||||
|
elem.extract()
|
||||||
|
|
||||||
|
def format_references(refs_content):
|
||||||
|
"""Format references with proper MsoBibliography style."""
|
||||||
|
refs_content = refs_content.replace('# Referencias bibliográficas {.unnumbered}', '').strip()
|
||||||
|
refs_html = ''
|
||||||
|
|
||||||
|
for line in refs_content.split('\n\n'):
|
||||||
|
line = line.strip()
|
||||||
|
if not line:
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Apply markdown formatting
|
||||||
|
formatted = md_to_html_para(line)
|
||||||
|
|
||||||
|
# Use MsoBibliography style with hanging indent (36pt indent, -36pt text-indent)
|
||||||
|
refs_html += f'''<p class=MsoBibliography style="margin-left:36.0pt;text-indent:-36.0pt"><span lang=ES>{formatted}</span></p>\n'''
|
||||||
|
|
||||||
|
return refs_html
|
||||||
|
|
||||||
|
def extract_resumen_parts(resumen_content):
|
||||||
|
"""Extract Spanish resumen and English abstract from 00_resumen.md"""
|
||||||
|
parts = resumen_content.split('---')
|
||||||
|
|
||||||
|
spanish_part = parts[0] if len(parts) > 0 else ''
|
||||||
|
english_part = parts[1] if len(parts) > 1 else ''
|
||||||
|
|
||||||
|
# Extract Spanish content
|
||||||
|
spanish_text = ''
|
||||||
|
spanish_keywords = ''
|
||||||
|
if '**Palabras clave:**' in spanish_part:
|
||||||
|
text_part, kw_part = spanish_part.split('**Palabras clave:**')
|
||||||
|
spanish_text = text_part.replace('# Resumen', '').strip()
|
||||||
|
spanish_keywords = kw_part.strip()
|
||||||
|
else:
|
||||||
|
spanish_text = spanish_part.replace('# Resumen', '').strip()
|
||||||
|
|
||||||
|
# Extract English content
|
||||||
|
english_text = ''
|
||||||
|
english_keywords = ''
|
||||||
|
if '**Keywords:**' in english_part:
|
||||||
|
text_part, kw_part = english_part.split('**Keywords:**')
|
||||||
|
english_text = text_part.replace('# Abstract', '').strip()
|
||||||
|
english_keywords = kw_part.strip()
|
||||||
|
else:
|
||||||
|
english_text = english_part.replace('# Abstract', '').strip()
|
||||||
|
|
||||||
|
return spanish_text, spanish_keywords, english_text, english_keywords
|
||||||
|
|
||||||
|
def main():
|
||||||
|
global table_counter, figure_counter
|
||||||
|
|
||||||
|
print("Reading template...")
|
||||||
|
html_content = read_file(TEMPLATE)
|
||||||
|
soup = BeautifulSoup(html_content, 'html.parser')
|
||||||
|
|
||||||
|
print("Reading docs content...")
|
||||||
|
docs = {
|
||||||
|
'resumen': read_file(os.path.join(DOCS_DIR, '00_resumen.md')),
|
||||||
|
'intro': read_file(os.path.join(DOCS_DIR, '01_introduccion.md')),
|
||||||
|
'contexto': read_file(os.path.join(DOCS_DIR, '02_contexto_estado_arte.md')),
|
||||||
|
'objetivos': read_file(os.path.join(DOCS_DIR, '03_objetivos_metodologia.md')),
|
||||||
|
'desarrollo': read_file(os.path.join(DOCS_DIR, '04_desarrollo_especifico.md')),
|
||||||
|
'conclusiones': read_file(os.path.join(DOCS_DIR, '05_conclusiones_trabajo_futuro.md')),
|
||||||
|
'referencias': read_file(os.path.join(DOCS_DIR, '06_referencias_bibliograficas.md')),
|
||||||
|
'anexo': read_file(os.path.join(DOCS_DIR, '07_anexo_a.md')),
|
||||||
|
}
|
||||||
|
|
||||||
|
# Extract resumen and abstract
|
||||||
|
spanish_text, spanish_kw, english_text, english_kw = extract_resumen_parts(docs['resumen'])
|
||||||
|
|
||||||
|
# Replace title
|
||||||
|
print("Replacing title...")
|
||||||
|
for elem in soup.find_all(string=re.compile(r'Título del TFE', re.IGNORECASE)):
|
||||||
|
elem.replace_with(elem.replace('Título del TFE', 'Optimización de Hiperparámetros OCR con Ray Tune para Documentos Académicos en Español'))
|
||||||
|
|
||||||
|
# Replace Resumen section
|
||||||
|
print("Replacing Resumen...")
|
||||||
|
resumen_title = soup.find('p', class_='Ttulondices', string=re.compile(r'Resumen'))
|
||||||
|
if resumen_title:
|
||||||
|
# Find and replace content after Resumen title until Abstract
|
||||||
|
current = resumen_title.find_next_sibling()
|
||||||
|
elements_to_remove = []
|
||||||
|
while current:
|
||||||
|
text = current.get_text() if hasattr(current, 'get_text') else str(current)
|
||||||
|
if 'Abstract' in text and current.name == 'p' and 'Ttulondices' in str(current.get('class', [])):
|
||||||
|
break
|
||||||
|
elements_to_remove.append(current)
|
||||||
|
current = current.find_next_sibling()
|
||||||
|
|
||||||
|
for elem in elements_to_remove:
|
||||||
|
if hasattr(elem, 'decompose'):
|
||||||
|
elem.decompose()
|
||||||
|
|
||||||
|
# Insert new resumen content
|
||||||
|
resumen_html = f'''<p class=MsoNormal><span lang=ES>{spanish_text}</span></p>
|
||||||
|
<p class=MsoNormal><span lang=ES><o:p> </o:p></span></p>
|
||||||
|
<p class=MsoNormal><b><span lang=ES>Palabras clave:</span></b><span lang=ES> {spanish_kw}</span></p>
|
||||||
|
<p class=MsoNormal><span lang=ES><o:p> </o:p></span></p>'''
|
||||||
|
resumen_soup = BeautifulSoup(resumen_html, 'html.parser')
|
||||||
|
insert_point = resumen_title
|
||||||
|
for new_elem in reversed(list(resumen_soup.children)):
|
||||||
|
insert_point.insert_after(new_elem)
|
||||||
|
print(" ✓ Replaced Resumen")
|
||||||
|
|
||||||
|
# Replace Abstract section
|
||||||
|
print("Replacing Abstract...")
|
||||||
|
abstract_title = soup.find('p', class_='Ttulondices', string=re.compile(r'Abstract'))
|
||||||
|
if abstract_title:
|
||||||
|
# Find and replace content after Abstract title until next major section
|
||||||
|
current = abstract_title.find_next_sibling()
|
||||||
|
elements_to_remove = []
|
||||||
|
while current:
|
||||||
|
# Stop at page break or next title
|
||||||
|
if current.name == 'span' and 'page-break' in str(current):
|
||||||
|
break
|
||||||
|
text = current.get_text() if hasattr(current, 'get_text') else str(current)
|
||||||
|
if current.name == 'p' and ('Ttulondices' in str(current.get('class', [])) or 'MsoToc' in str(current.get('class', []))):
|
||||||
|
break
|
||||||
|
elements_to_remove.append(current)
|
||||||
|
current = current.find_next_sibling()
|
||||||
|
|
||||||
|
for elem in elements_to_remove:
|
||||||
|
if hasattr(elem, 'decompose'):
|
||||||
|
elem.decompose()
|
||||||
|
|
||||||
|
# Insert new abstract content
|
||||||
|
abstract_html = f'''<p class=MsoNormal><span lang=EN-US>{english_text}</span></p>
|
||||||
|
<p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p>
|
||||||
|
<p class=MsoNormal><b><span lang=EN-US>Keywords:</span></b><span lang=EN-US> {english_kw}</span></p>
|
||||||
|
<p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p>'''
|
||||||
|
abstract_soup = BeautifulSoup(abstract_html, 'html.parser')
|
||||||
|
insert_point = abstract_title
|
||||||
|
for new_elem in reversed(list(abstract_soup.children)):
|
||||||
|
insert_point.insert_after(new_elem)
|
||||||
|
print(" ✓ Replaced Abstract")
|
||||||
|
|
||||||
|
# Remove "Importante" callout boxes (template instructions)
|
||||||
|
print("Removing template instructions...")
|
||||||
|
for div in soup.find_all('div'):
|
||||||
|
text = div.get_text()
|
||||||
|
if 'Importante:' in text and 'extensión mínima' in text:
|
||||||
|
div.decompose()
|
||||||
|
print(" ✓ Removed 'Importante' box")
|
||||||
|
|
||||||
|
# Remove "Ejemplo de nota al pie" footnote
|
||||||
|
for elem in soup.find_all(string=re.compile(r'Ejemplo de nota al pie')):
|
||||||
|
parent = elem.parent
|
||||||
|
if parent:
|
||||||
|
# Find the footnote container and remove it
|
||||||
|
while parent and parent.name != 'p':
|
||||||
|
parent = parent.parent
|
||||||
|
if parent:
|
||||||
|
parent.decompose()
|
||||||
|
print(" ✓ Removed footnote example")
|
||||||
|
|
||||||
|
# Clear old figure/table index entries (they need to be regenerated in Word)
|
||||||
|
print("Clearing old index entries...")
|
||||||
|
|
||||||
|
# Remove ALL content from MsoTof paragraphs that reference template examples
|
||||||
|
# The indices will be regenerated when user opens in Word and presses Ctrl+A, F9
|
||||||
|
for p in soup.find_all('p', class_='MsoTof'):
|
||||||
|
text = p.get_text()
|
||||||
|
# Check for figure index entries with template examples
|
||||||
|
if 'Figura' in text and 'Ejemplo' in text:
|
||||||
|
# Remove all <a> tags (the actual index entry links)
|
||||||
|
for a in p.find_all('a'):
|
||||||
|
a.decompose()
|
||||||
|
# Also remove any remaining text content that shows the example
|
||||||
|
for span in p.find_all('span', style=lambda x: x and 'mso-no-proof' in str(x)):
|
||||||
|
if 'Ejemplo' in span.get_text():
|
||||||
|
span.decompose()
|
||||||
|
print(" ✓ Cleared figure index example entry")
|
||||||
|
# Check for table index entries with template examples
|
||||||
|
if 'Tabla' in text and 'Ejemplo' in text:
|
||||||
|
for a in p.find_all('a'):
|
||||||
|
a.decompose()
|
||||||
|
for span in p.find_all('span', style=lambda x: x and 'mso-no-proof' in str(x)):
|
||||||
|
if 'Ejemplo' in span.get_text():
|
||||||
|
span.decompose()
|
||||||
|
print(" ✓ Cleared table index example entry")
|
||||||
|
|
||||||
|
# Remove old figure index entries that reference template examples
|
||||||
|
for p in soup.find_all('p', class_='MsoToc3'):
|
||||||
|
text = p.get_text()
|
||||||
|
if 'Figura 1. Ejemplo' in text or 'Tabla 1. Ejemplo' in text:
|
||||||
|
p.decompose()
|
||||||
|
print(" ✓ Removed template index entry")
|
||||||
|
|
||||||
|
# Also clear the specific figure/table from template
|
||||||
|
for p in soup.find_all('p', class_='Imagencentrada'):
|
||||||
|
p.decompose()
|
||||||
|
print(" ✓ Removed template figure placeholder")
|
||||||
|
|
||||||
|
# Remove template table example
|
||||||
|
for table in soup.find_all('table', class_='MsoTableGrid'):
|
||||||
|
# Check if this is the template example table
|
||||||
|
text = table.get_text()
|
||||||
|
if 'Celda 1' in text or 'Encabezado 1' in text:
|
||||||
|
# Also remove surrounding caption and source
|
||||||
|
prev_sib = table.find_previous_sibling()
|
||||||
|
next_sib = table.find_next_sibling()
|
||||||
|
if prev_sib and 'Tabla 1. Ejemplo' in prev_sib.get_text():
|
||||||
|
prev_sib.decompose()
|
||||||
|
if next_sib and 'Fuente:' in next_sib.get_text():
|
||||||
|
next_sib.decompose()
|
||||||
|
table.decompose()
|
||||||
|
print(" ✓ Removed template table example")
|
||||||
|
break
|
||||||
|
|
||||||
|
# Define chapters with their keywords and next chapter keywords
|
||||||
|
chapters = [
|
||||||
|
('Introducción', 'intro', 'Contexto'),
|
||||||
|
('Contexto', 'contexto', 'Objetivos'),
|
||||||
|
('Objetivos', 'objetivos', 'Desarrollo'),
|
||||||
|
('Desarrollo', 'desarrollo', 'Conclusiones'),
|
||||||
|
('Conclusiones', 'conclusiones', 'Referencias'),
|
||||||
|
]
|
||||||
|
|
||||||
|
print("Replacing chapter contents...")
|
||||||
|
for chapter_keyword, doc_key, next_keyword in chapters:
|
||||||
|
print(f" Processing: {chapter_keyword}")
|
||||||
|
|
||||||
|
# Reset counters for consistent numbering per chapter (optional - remove if you want global numbering)
|
||||||
|
# table_counter = 0
|
||||||
|
# figure_counter = 0
|
||||||
|
|
||||||
|
start_elem = find_section_element(soup, chapter_keyword)
|
||||||
|
end_elem = find_section_element(soup, next_keyword)
|
||||||
|
|
||||||
|
if start_elem and end_elem:
|
||||||
|
remove_elements_between(start_elem, end_elem)
|
||||||
|
new_content_html = extract_section_content(docs[doc_key])
|
||||||
|
new_soup = BeautifulSoup(new_content_html, 'html.parser')
|
||||||
|
insert_point = start_elem
|
||||||
|
for new_elem in reversed(list(new_soup.children)):
|
||||||
|
insert_point.insert_after(new_elem)
|
||||||
|
print(f" ✓ Replaced content")
|
||||||
|
else:
|
||||||
|
if not start_elem:
|
||||||
|
print(f" Warning: Could not find start element for {chapter_keyword}")
|
||||||
|
if not end_elem:
|
||||||
|
print(f" Warning: Could not find end element for {next_keyword}")
|
||||||
|
|
||||||
|
# Handle Referencias
|
||||||
|
print(" Processing: Referencias bibliográficas")
|
||||||
|
refs_start = find_section_element(soup, 'Referencias')
|
||||||
|
anexo_elem = find_section_element(soup, 'Anexo')
|
||||||
|
|
||||||
|
if refs_start and anexo_elem:
|
||||||
|
remove_elements_between(refs_start, anexo_elem)
|
||||||
|
refs_html = format_references(docs['referencias'])
|
||||||
|
refs_soup = BeautifulSoup(refs_html, 'html.parser')
|
||||||
|
insert_point = refs_start
|
||||||
|
for new_elem in reversed(list(refs_soup.children)):
|
||||||
|
insert_point.insert_after(new_elem)
|
||||||
|
print(f" ✓ Replaced content")
|
||||||
|
|
||||||
|
# Handle Anexo (last section)
|
||||||
|
print(" Processing: Anexo")
|
||||||
|
if anexo_elem:
|
||||||
|
body = soup.find('body')
|
||||||
|
if body:
|
||||||
|
current = anexo_elem.next_sibling
|
||||||
|
while current:
|
||||||
|
next_elem = current.next_sibling
|
||||||
|
if hasattr(current, 'decompose'):
|
||||||
|
current.decompose()
|
||||||
|
elif isinstance(current, NavigableString):
|
||||||
|
current.extract()
|
||||||
|
current = next_elem
|
||||||
|
|
||||||
|
anexo_content = extract_section_content(docs['anexo'])
|
||||||
|
anexo_soup = BeautifulSoup(anexo_content, 'html.parser')
|
||||||
|
insert_point = anexo_elem
|
||||||
|
for new_elem in reversed(list(anexo_soup.children)):
|
||||||
|
insert_point.insert_after(new_elem)
|
||||||
|
print(f" ✓ Replaced content")
|
||||||
|
|
||||||
|
print(f"\nSummary: {table_counter} tables, {figure_counter} figures processed")
|
||||||
|
|
||||||
|
print("Saving modified template...")
|
||||||
|
output_html = str(soup)
|
||||||
|
write_file(TEMPLATE, output_html)
|
||||||
|
|
||||||
|
print(f"✓ Done! Modified: {TEMPLATE}")
|
||||||
|
print("\nTo convert to DOCX:")
|
||||||
|
print("1. Open the .htm file in Microsoft Word")
|
||||||
|
print("2. Replace [Insertar diagrama Mermaid aquí] placeholders with actual diagrams")
|
||||||
|
print("3. Update indices: Select all (Ctrl+A) then press F9 to update fields")
|
||||||
|
print(" - This will regenerate: Índice de contenidos, Índice de figuras, Índice de tablas")
|
||||||
|
print("4. Save as .docx")
|
||||||
|
|
||||||
|
if __name__ == '__main__':
|
||||||
|
main()
|
||||||
543
claude.md
Normal file
@@ -0,0 +1,543 @@
|
|||||||
|
# Claude Code Context - Masters Thesis OCR Project
|
||||||
|
|
||||||
|
## Project Overview
|
||||||
|
|
||||||
|
This is a **Master's Thesis (TFM)** for UNIR's Master in Artificial Intelligence. The project focuses on **OCR hyperparameter optimization** using Ray Tune with Optuna for Spanish academic documents.
|
||||||
|
|
||||||
|
**Author:** Sergio Jiménez Jiménez
|
||||||
|
**University:** UNIR (Universidad Internacional de La Rioja)
|
||||||
|
**Year:** 2025
|
||||||
|
|
||||||
|
## Key Context
|
||||||
|
|
||||||
|
### Why Hyperparameter Optimization Instead of Fine-tuning
|
||||||
|
|
||||||
|
Due to **hardware limitations** (no dedicated GPU, CPU-only execution), the project pivoted from fine-tuning to hyperparameter optimization:
|
||||||
|
- Fine-tuning deep learning models without GPU is prohibitively slow
|
||||||
|
- Inference time is ~69 seconds/page on CPU
|
||||||
|
- Hyperparameter optimization proved to be an effective alternative, achieving 80.9% CER reduction
|
||||||
|
|
||||||
|
### Main Results
|
||||||
|
|
||||||
|
| Model | CER | Character Accuracy |
|
||||||
|
|-------|-----|-------------------|
|
||||||
|
| PaddleOCR Baseline | 7.78% | 92.22% |
|
||||||
|
| PaddleOCR-HyperAdjust | **1.49%** | **98.51%** |
|
||||||
|
|
||||||
|
**Goal achieved:** CER < 2% (target was < 2%, result is 1.49%)
|
||||||
|
|
||||||
|
### Optimal Configuration Found
|
||||||
|
|
||||||
|
```python
|
||||||
|
config_optimizada = {
|
||||||
|
"textline_orientation": True, # CRITICAL - reduces CER ~70%
|
||||||
|
"use_doc_orientation_classify": False,
|
||||||
|
"use_doc_unwarping": False,
|
||||||
|
"text_det_thresh": 0.4690,
|
||||||
|
"text_det_box_thresh": 0.5412,
|
||||||
|
"text_det_unclip_ratio": 0.0,
|
||||||
|
"text_rec_score_thresh": 0.6350,
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Key Findings
|
||||||
|
|
||||||
|
1. `textline_orientation=True` is the most impactful parameter (reduces CER by 69.7%)
|
||||||
|
2. `text_det_thresh` has -0.52 correlation with CER; values < 0.1 cause catastrophic failures
|
||||||
|
3. Document correction modules (`use_doc_orientation_classify`, `use_doc_unwarping`) are unnecessary for digital PDFs
|
||||||
|
|
||||||
|
## Repository Structure
|
||||||
|
|
||||||
|
```
|
||||||
|
MastersThesis/
|
||||||
|
├── docs/ # Thesis chapters in Markdown (UNIR template structure)
|
||||||
|
│ ├── 00_resumen.md # Resumen + Abstract + Keywords
|
||||||
|
│ ├── 01_introduccion.md # 1. Introducción (1.1, 1.2, 1.3)
|
||||||
|
│ ├── 02_contexto_estado_arte.md # 2. Contexto y estado del arte (2.1, 2.2, 2.3)
|
||||||
|
│ ├── 03_objetivos_metodologia.md # 3. Objetivos y metodología (3.1, 3.2, 3.3, 3.4)
|
||||||
|
│ ├── 04_desarrollo_especifico.md # 4. Desarrollo específico (4.1, 4.2, 4.3)
|
||||||
|
│ ├── 05_conclusiones_trabajo_futuro.md # 5. Conclusiones (5.1, 5.2)
|
||||||
|
│ ├── 06_referencias_bibliograficas.md # Referencias bibliográficas (APA format)
|
||||||
|
│ └── 07_anexo_a.md # Anexo A: Código fuente y datos
|
||||||
|
├── thesis_output/ # Generated thesis document
|
||||||
|
│ ├── plantilla_individual.htm # Complete TFM (open in Word)
|
||||||
|
│ └── figures/ # PNG figures from Mermaid diagrams
|
||||||
|
│ ├── figura_1.png ... figura_7.png
|
||||||
|
│ └── figures_manifest.json
|
||||||
|
├── src/
|
||||||
|
│ ├── paddle_ocr_fine_tune_unir_raytune.ipynb # Main experiment (64 trials)
|
||||||
|
│ ├── paddle_ocr_tuning.py # CLI evaluation script
|
||||||
|
│ ├── dataset_manager.py # ImageTextDataset class
|
||||||
|
│ ├── prepare_dataset.ipynb # Dataset preparation
|
||||||
|
│ └── raytune_paddle_subproc_results_20251207_192320.csv # 64 trial results
|
||||||
|
├── results/ # Benchmark results CSVs
|
||||||
|
├── instructions/ # UNIR instructions and template
|
||||||
|
│ ├── instrucciones.pdf # TFE writing guidelines
|
||||||
|
│ ├── plantilla_individual.pdf # Word template (PDF version)
|
||||||
|
│ └── plantilla_individual.htm # Word template (HTML version, source)
|
||||||
|
├── apply_content.py # Generates TFM document from docs/ + template
|
||||||
|
├── generate_mermaid_figures.py # Converts Mermaid diagrams to PNG
|
||||||
|
├── ocr_benchmark_notebook.ipynb # Initial OCR benchmark
|
||||||
|
└── README.md
|
||||||
|
```
|
||||||
|
|
||||||
|
### docs/ to Template Mapping
|
||||||
|
|
||||||
|
The template (`plantilla_individual.pdf`) requires **5 chapters**. The docs/ files now match this structure exactly:
|
||||||
|
|
||||||
|
| Template Section | docs/ File | Notes |
|
||||||
|
|-----------------|------------|-------|
|
||||||
|
| Resumen | `00_resumen.md` (Spanish part) | 150-300 words + Palabras clave |
|
||||||
|
| Abstract | `00_resumen.md` (English part) | 150-300 words + Keywords |
|
||||||
|
| 1. Introducción | `01_introduccion.md` | Subsections 1.1, 1.2, 1.3 |
|
||||||
|
| 2. Contexto y estado del arte | `02_contexto_estado_arte.md` | Subsections 2.1, 2.2, 2.3 + Mermaid diagrams |
|
||||||
|
| 3. Objetivos y metodología | `03_objetivos_metodologia.md` | Subsections 3.1, 3.2, 3.3, 3.4 + Mermaid diagrams |
|
||||||
|
| 4. Desarrollo específico | `04_desarrollo_especifico.md` | Subsections 4.1, 4.2, 4.3 + Mermaid charts |
|
||||||
|
| 5. Conclusiones y trabajo futuro | `05_conclusiones_trabajo_futuro.md` | Subsections 5.1, 5.2 |
|
||||||
|
| Referencias bibliográficas | `06_referencias_bibliograficas.md` | APA, alphabetical |
|
||||||
|
| Anexo A | `07_anexo_a.md` | Repository URL + structure |
|
||||||
|
|
||||||
|
## Important Data Files
|
||||||
|
|
||||||
|
### Results CSV Files
|
||||||
|
- `src/raytune_paddle_subproc_results_20251207_192320.csv` - 64 Ray Tune trials with configs and metrics (PRIMARY DATA SOURCE)
|
||||||
|
|
||||||
|
### Key Notebooks
|
||||||
|
- `src/paddle_ocr_fine_tune_unir_raytune.ipynb` - Main Ray Tune experiment
|
||||||
|
- `src/prepare_dataset.ipynb` - PDF to image/text conversion
|
||||||
|
- `ocr_benchmark_notebook.ipynb` - EasyOCR vs PaddleOCR vs DocTR comparison
|
||||||
|
|
||||||
|
## Technical Stack
|
||||||
|
|
||||||
|
| Component | Version |
|
||||||
|
|-----------|---------|
|
||||||
|
| Python | 3.11.9 |
|
||||||
|
| PaddlePaddle | 3.2.2 |
|
||||||
|
| PaddleOCR | 3.3.2 |
|
||||||
|
| Ray | 2.52.1 |
|
||||||
|
| Optuna | 4.6.0 |
|
||||||
|
|
||||||
|
## Pending Work
|
||||||
|
|
||||||
|
### Completed Tasks
|
||||||
|
- [x] **Structure docs/ to match UNIR template** - All chapters now follow exact numbering (1.1, 1.2, etc.)
|
||||||
|
- [x] **Add Mermaid diagrams** - 7 diagrams added (OCR pipeline, Ray Tune architecture, methodology flowcharts, CER comparison charts)
|
||||||
|
- [x] **Generate unified thesis document** - `apply_content.py` generates complete document from docs/
|
||||||
|
- [x] **Convert Mermaid to PNG** - `generate_mermaid_figures.py` generates figures automatically
|
||||||
|
- [x] **Proper template formatting** - Tables/figures use `Piedefoto-tabla` class, references use `MsoBibliography`
|
||||||
|
|
||||||
|
### Priority Tasks
|
||||||
|
1. **Validate on other document types** - Test optimal config on invoices, forms, contracts
|
||||||
|
2. **Expand dataset** - Current dataset has only 24 pages
|
||||||
|
3. **Create presentation slides** - For thesis defense
|
||||||
|
4. **Final document review** - Open in Word, update indices (Ctrl+A, F9), verify formatting
|
||||||
|
|
||||||
|
### Optional Extensions
|
||||||
|
- Explore `text_det_unclip_ratio` parameter (was fixed at 0.0)
|
||||||
|
- Compare with actual fine-tuning (if GPU access obtained)
|
||||||
|
- Multi-objective optimization (CER + WER + inference time)
|
||||||
|
|
||||||
|
## Thesis Document Generation
|
||||||
|
|
||||||
|
To regenerate the thesis document:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# 1. Generate PNG figures from Mermaid diagrams
|
||||||
|
python3 generate_mermaid_figures.py
|
||||||
|
|
||||||
|
# 2. Apply docs/ content to UNIR template
|
||||||
|
python3 apply_content.py
|
||||||
|
|
||||||
|
# 3. Open in Word and finalize
|
||||||
|
# - Open thesis_output/plantilla_individual.htm in Microsoft Word
|
||||||
|
# - Press Ctrl+A then F9 to update all indices
|
||||||
|
# - Save as .docx
|
||||||
|
```
|
||||||
|
|
||||||
|
**What `apply_content.py` does:**
|
||||||
|
- Replaces Resumen and Abstract with actual content + keywords
|
||||||
|
- Replaces all 5 chapters with content from docs/
|
||||||
|
- Replaces Referencias with APA-formatted bibliography
|
||||||
|
- Replaces Anexo with repository information
|
||||||
|
- Converts Mermaid diagrams to embedded PNG images
|
||||||
|
- Formats tables with `Piedefoto-tabla` captions and sources
|
||||||
|
- Removes template instruction text ("Importante:", "Ejemplo de nota al pie", etc.)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## UNIR TFE Document Guidelines
|
||||||
|
|
||||||
|
**CRITICAL:** The thesis MUST follow UNIR's official template (`instructions/plantilla_individual.pdf`) and guidelines (`instructions/instrucciones.pdf`).
|
||||||
|
|
||||||
|
### Work Type Classification
|
||||||
|
|
||||||
|
This thesis is a **hybrid of Type 1 (Piloto experimental) and Type 3 (Comparativa de soluciones)**:
|
||||||
|
- Comparative study of OCR solutions (EasyOCR, PaddleOCR, DocTR)
|
||||||
|
- Experimental pilot with Ray Tune hyperparameter optimization
|
||||||
|
- 64 trials executed, results analyzed statistically
|
||||||
|
|
||||||
|
### Document Structure (from plantilla_individual.pdf - MANDATORY)
|
||||||
|
|
||||||
|
The TFE must follow this EXACT structure from the official template:
|
||||||
|
|
||||||
|
| Section | Subsections | Notes |
|
||||||
|
|---------|-------------|-------|
|
||||||
|
| **Portada** | Title, Author, Type, Director, Date | Use template format exactly |
|
||||||
|
| **Resumen** | 150-300 words + 3-5 Palabras clave | Spanish summary |
|
||||||
|
| **Abstract** | 150-300 words + 3-5 Keywords | English summary |
|
||||||
|
| **Índice de contenidos** | Auto-generated | New page |
|
||||||
|
| **Índice de figuras** | Auto-generated | New page |
|
||||||
|
| **Índice de tablas** | Auto-generated | New page |
|
||||||
|
| **1. Introducción** | 1.1 Motivación, 1.2 Planteamiento del trabajo, 1.3 Estructura del trabajo | 3-5 pages |
|
||||||
|
| **2. Contexto y estado del arte** | 2.1 Contexto del problema, 2.2 Estado del arte, 2.3 Conclusiones | 10-15 pages |
|
||||||
|
| **3. Objetivos concretos y metodología** | 3.1 Objetivo general, 3.2 Objetivos específicos, 3.3 Metodología del trabajo | Variable |
|
||||||
|
| **4. Desarrollo específico** | Varies by work type (see below) | Main content |
|
||||||
|
| **5. Conclusiones y trabajo futuro** | 5.1 Conclusiones, 5.2 Líneas de trabajo futuro | Variable |
|
||||||
|
| **Referencias bibliográficas** | APA format, alphabetical, hanging indent | Variable |
|
||||||
|
| **Anexo A** | Código fuente y datos analizados | Repository URL |
|
||||||
|
|
||||||
|
**Total length:** 50-90 pages (excluding cover, resumen, abstract, indices, annexes)
|
||||||
|
|
||||||
|
### Chapter-Specific Requirements (from plantilla_individual.pdf)
|
||||||
|
|
||||||
|
#### 1. Introducción
|
||||||
|
The introduction must give a clear first idea of what was intended, the conclusions reached, and the procedure followed. Key ideas: problem identification, justification of importance, general objectives, preview of contribution.
|
||||||
|
|
||||||
|
**1.1 Motivación:**
|
||||||
|
- Present the problem to solve
|
||||||
|
- Justify importance to educational/scientific community
|
||||||
|
- Answer: What problem? What are the causes? Why is it relevant?
|
||||||
|
- Must include references to prior research
|
||||||
|
|
||||||
|
**1.2 Planteamiento del trabajo:**
|
||||||
|
- Briefly state the problem/need detected
|
||||||
|
- Describe the proposal and purpose
|
||||||
|
- Answer: How to solve? What is proposed?
|
||||||
|
|
||||||
|
**1.3 Estructura del trabajo:**
|
||||||
|
- Briefly describe what each subsequent chapter contains
|
||||||
|
|
||||||
|
#### 2. Contexto y estado del arte
|
||||||
|
Study the application domain in depth, citing numerous references. Must consult different sources (not just online - also technical manuals, books).
|
||||||
|
|
||||||
|
**2.1 Contexto del problema:**
|
||||||
|
- Deep study of the application domain
|
||||||
|
|
||||||
|
**2.2 Estado del arte:**
|
||||||
|
- Antecedents, current studies, comparison of existing tools
|
||||||
|
- Must reference key authors in the field (justify exclusions)
|
||||||
|
|
||||||
|
**2.3 Conclusiones:**
|
||||||
|
- Summary linking research to the work to be done
|
||||||
|
- How findings affect the specific development
|
||||||
|
|
||||||
|
#### 3. Objetivos concretos y metodología de trabajo
|
||||||
|
Bridge between domain study and contribution. Three required elements: (1) general objective, (2) specific objectives, (3) methodology.
|
||||||
|
|
||||||
|
**3.1 Objetivo general:**
|
||||||
|
- Must be SMART (Doran, 1981)
|
||||||
|
- Focus on achieving an observable effect, not just "create a tool"
|
||||||
|
- Example: "Mejorar el servicio X logrando Y valorado positivamente (mínimo 4/5) por Z"
|
||||||
|
|
||||||
|
**3.2 Objetivos específicos:**
|
||||||
|
- Divide general objective into analyzable sub-objectives
|
||||||
|
- Must be SMART
|
||||||
|
- Use infinitive verbs: Analizar, Calcular, Clasificar, Comparar, Conocer, Cuantificar, Desarrollar, Describir, Descubrir, Determinar, Establecer, Explorar, Identificar, Indagar, Medir, Sintetizar, Verificar
|
||||||
|
- Typically ~5 objectives: 1-2 about state of art, 2-3 about development
|
||||||
|
|
||||||
|
**3.3 Metodología del trabajo:**
|
||||||
|
- Describe steps to achieve objectives
|
||||||
|
- Explain WHY each step
|
||||||
|
- What instruments will be used
|
||||||
|
- How results will be analyzed
|
||||||
|
|
||||||
|
#### 4. Desarrollo específico de la contribución
|
||||||
|
Structure depends on work type. Organize by methodology phases/activities.
|
||||||
|
|
||||||
|
**For Type 1 (Piloto experimental):**
|
||||||
|
- 4.1 Descripción detallada del experimento
|
||||||
|
- Technologies used (with justification)
|
||||||
|
- How pilot was organized
|
||||||
|
- Participants (demographics)
|
||||||
|
- Automatic evaluation techniques
|
||||||
|
- How experiment proceeded
|
||||||
|
- Monitoring/evaluation instruments
|
||||||
|
- Statistical analysis types
|
||||||
|
- 4.2 Descripción de los resultados (objective, no interpretation)
|
||||||
|
- Summary tables, result graphs, relevant data identification
|
||||||
|
- 4.3 Discusión
|
||||||
|
- Relevance of results, explanations for anomalies, highlight key findings
|
||||||
|
|
||||||
|
**For Type 3 (Comparativa de soluciones):**
|
||||||
|
- 4.1 Planteamiento de la comparativa
|
||||||
|
- Problem identification, alternative solutions to evaluate
|
||||||
|
- Success criteria, measures to take
|
||||||
|
- 4.2 Desarrollo de la comparativa
|
||||||
|
- All results and measurements obtained
|
||||||
|
- Graphs, tables, data visualization
|
||||||
|
- 4.3 Discusión y análisis de resultados
|
||||||
|
- Discussion of meaning, advantages/disadvantages of solutions
|
||||||
|
|
||||||
|
#### 5. Conclusiones y trabajo futuro
|
||||||
|
|
||||||
|
**5.1 Conclusiones:**
|
||||||
|
- Summary of problem, approach, and why solution is valid
|
||||||
|
- Summary of contributions
|
||||||
|
- **Relate contributions and results to objectives** - discuss degree of achievement
|
||||||
|
|
||||||
|
**5.2 Líneas de trabajo futuro:**
|
||||||
|
- Future work that would add value
|
||||||
|
- Justify how contribution can be used and in what fields
|
||||||
|
|
||||||
|
### SMART Objectives Requirements
|
||||||
|
|
||||||
|
ALL objectives (general and specific) MUST be SMART:
|
||||||
|
|
||||||
|
| Criterion | Requirement | Example from this thesis |
|
||||||
|
|-----------|-------------|-------------------------|
|
||||||
|
| **S**pecific | Clearly define what to achieve | "Optimizar PaddleOCR para documentos en español" |
|
||||||
|
| **M**easurable | Quantifiable success metric | "CER < 2%" |
|
||||||
|
| **A**ttainable | Feasible with available resources | "Sin GPU, usando optimización de hiperparámetros" |
|
||||||
|
| **R**elevant | Demonstrable impact | "Mejora extracción de texto en documentos académicos" |
|
||||||
|
| **T**ime-bound | Achievable in timeframe | "Un cuatrimestre" |
|
||||||
|
|
||||||
|
### Citation and Reference Rules
|
||||||
|
|
||||||
|
#### APA Format is MANDATORY
|
||||||
|
|
||||||
|
Reference guide: https://bibliografiaycitas.unir.net/
|
||||||
|
|
||||||
|
**In-text citations:**
|
||||||
|
- Single author: (Du, 2020) or Du (2020)
|
||||||
|
- Two authors: (Du & Li, 2020)
|
||||||
|
- Three+ authors: (Du et al., 2020)
|
||||||
|
|
||||||
|
**Reference list examples:**
|
||||||
|
```
|
||||||
|
# Journal article with DOI
|
||||||
|
Shi, B., Bai, X., & Yao, C. (2016). An end-to-end trainable neural network
|
||||||
|
for image-based sequence recognition. IEEE Transactions on Pattern
|
||||||
|
Analysis and Machine Intelligence, 39(11), 2298-2304.
|
||||||
|
https://doi.org/10.1109/TPAMI.2016.2646371
|
||||||
|
|
||||||
|
# Conference paper
|
||||||
|
Akiba, T., Sano, S., Yanase, T., Ohta, T., & Koyama, M. (2019). Optuna:
|
||||||
|
A next-generation hyperparameter optimization framework. Proceedings
|
||||||
|
of the 25th ACM SIGKDD, 2623-2631.
|
||||||
|
https://doi.org/10.1145/3292500.3330701
|
||||||
|
|
||||||
|
# arXiv preprint
|
||||||
|
Du, Y., Li, C., Guo, R., ... & Wang, H. (2020). PP-OCR: A practical ultra
|
||||||
|
lightweight OCR system. arXiv preprint arXiv:2009.09941.
|
||||||
|
https://arxiv.org/abs/2009.09941
|
||||||
|
|
||||||
|
# Software/GitHub repository
|
||||||
|
PaddlePaddle. (2024). PaddleOCR: Awesome multilingual OCR toolkits based
|
||||||
|
on PaddlePaddle. GitHub. https://github.com/PaddlePaddle/PaddleOCR
|
||||||
|
|
||||||
|
# Book
|
||||||
|
Cohen, J. (1988). Statistical power analysis for the behavioral sciences
|
||||||
|
(2nd ed.). Lawrence Erlbaum Associates.
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Reference Rules
|
||||||
|
- **NO Wikipedia citations**
|
||||||
|
- Include variety: books, conferences, journal articles (not just URLs)
|
||||||
|
- All cited references must appear in reference list
|
||||||
|
- All references in list must be cited in text
|
||||||
|
- Order alphabetically by first author's surname
|
||||||
|
- Include DOI or URL when available
|
||||||
|
|
||||||
|
### Document Formatting Rules
|
||||||
|
|
||||||
|
#### Page Setup
|
||||||
|
| Element | Specification |
|
||||||
|
|---------|--------------|
|
||||||
|
| Page size | A4 |
|
||||||
|
| Left margin | 3.0 cm |
|
||||||
|
| Right margin | 2.0 cm |
|
||||||
|
| Top/Bottom margins | 2.5 cm |
|
||||||
|
| Header | Student name + TFE title |
|
||||||
|
| Footer | Page number |
|
||||||
|
|
||||||
|
#### Typography
|
||||||
|
| Element | Format |
|
||||||
|
|---------|--------|
|
||||||
|
| Body text | Calibri 12, justified, 1.5 line spacing, 6pt before/after |
|
||||||
|
| Título 1 | Calibri Light 18, blue, justified, 1.5 spacing |
|
||||||
|
| Título 2 | Calibri Light 14, blue, justified, 1.5 spacing |
|
||||||
|
| Título 3 | Calibri Light 12, justified, 1.5 spacing |
|
||||||
|
| Footnotes | Calibri 10, justified, single spacing |
|
||||||
|
| Code | Can reduce to 9pt if needed |
|
||||||
|
|
||||||
|
#### Tables and Figures (from plantilla_individual.pdf)
|
||||||
|
|
||||||
|
**Table format example:**
|
||||||
|
```
|
||||||
|
Tabla 1. Ejemplo de tabla con sus principales elementos.
|
||||||
|
[TABLE CONTENT]
|
||||||
|
Fuente: American Psychological Association, 2020a.
|
||||||
|
```
|
||||||
|
|
||||||
|
**Figure format example:**
|
||||||
|
```
|
||||||
|
Figura 1. Ejemplo de figura realizada para nuestro trabajo.
|
||||||
|
[FIGURE]
|
||||||
|
Fuente: American Psychological Association, 2020b.
|
||||||
|
```
|
||||||
|
|
||||||
|
**Rules:**
|
||||||
|
- **Title position**: Above the table/figure
|
||||||
|
- **Numbering format**: "**Tabla 1.**" / "**Figura 1.**" (Calibri 12, bold)
|
||||||
|
- **Title text**: Calibri 12, italic (after the number)
|
||||||
|
- **Source**: Below, centered, format "Fuente: Author, Year."
|
||||||
|
- Can reduce font to 9pt for dense tables
|
||||||
|
- Can use landscape orientation for large tables
|
||||||
|
- Tables should have horizontal lines only (no vertical lines) per APA style
|
||||||
|
|
||||||
|
### Writing Style Rules
|
||||||
|
|
||||||
|
#### MUST DO:
|
||||||
|
- Each chapter starts with introductory paragraph explaining content
|
||||||
|
- Each paragraph has at least 3 sentences
|
||||||
|
- Verify originality (cite all sources)
|
||||||
|
- Check spelling with Word corrector
|
||||||
|
- Ensure logical flow between paragraphs
|
||||||
|
- Define concepts and include pertinent citations
|
||||||
|
|
||||||
|
#### MUST NOT DO:
|
||||||
|
- Two consecutive headings without text between them
|
||||||
|
- Superfluous phrases and repetition of ideas
|
||||||
|
- Short paragraphs (less than 3 sentences)
|
||||||
|
- Missing figure/table numbers or titles
|
||||||
|
- Broken index generation
|
||||||
|
|
||||||
|
### Annexes Requirements
|
||||||
|
|
||||||
|
**Anexo A - Código fuente y datos:**
|
||||||
|
- Include repository URL where code is hosted
|
||||||
|
- Student must be sole author and owner of repository
|
||||||
|
- No commits from other users
|
||||||
|
- Data used should also be in repository
|
||||||
|
- If confidential (company project), justify why not shared
|
||||||
|
|
||||||
|
### Final Submission
|
||||||
|
|
||||||
|
- **Drafts**: Submit in Word format
|
||||||
|
- **Final deposit**: Submit in PDF format
|
||||||
|
- Verify all indices generate correctly before final submission
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Guidelines for Claude
|
||||||
|
|
||||||
|
### CRITICAL: Academic Rigor Requirements
|
||||||
|
|
||||||
|
**This is a Master's Thesis. Academic rigor is NON-NEGOTIABLE.**
|
||||||
|
|
||||||
|
#### DO NOT:
|
||||||
|
- **NEVER fabricate data or statistics** - Every number must come from an actual file in this repository
|
||||||
|
- **NEVER invent comparison results** - If we don't have data for EasyOCR or DocTR comparisons, don't make up numbers
|
||||||
|
- **NEVER assume or estimate values** - If a metric isn't in the CSV/notebook, don't include it
|
||||||
|
- **NEVER extrapolate beyond what the data shows** - 24 pages is a limited dataset, acknowledge this
|
||||||
|
- **NEVER claim results that weren't measured** - Only report what was actually computed
|
||||||
|
|
||||||
|
#### ALWAYS:
|
||||||
|
- **Read the source file first** before citing any result
|
||||||
|
- **Quote exact values** from CSV files (e.g., CER 0.011535 not "approximately 1%")
|
||||||
|
- **Reference the specific file and location** for every data point
|
||||||
|
- **Acknowledge limitations** explicitly (dataset size, CPU-only, single document type)
|
||||||
|
- **Distinguish between measured results and interpretations**
|
||||||
|
|
||||||
|
#### Data Sources (ONLY use these):
|
||||||
|
| Data Type | Source File |
|
||||||
|
|-----------|-------------|
|
||||||
|
| Ray Tune 64 trials | `src/raytune_paddle_subproc_results_20251207_192320.csv` |
|
||||||
|
| Experiment code | `src/paddle_ocr_fine_tune_unir_raytune.ipynb` |
|
||||||
|
| Final comparison | Output cells in the notebook (baseline vs optimized) |
|
||||||
|
|
||||||
|
#### Example of WRONG vs RIGHT:
|
||||||
|
|
||||||
|
**WRONG:** "EasyOCR achieved 8.5% CER while PaddleOCR achieved 5.2% CER"
|
||||||
|
(We don't have this comparison data in our results files)
|
||||||
|
|
||||||
|
**RIGHT:** "The optimization reduced CER from 7.78% to 1.49%, a reduction of 80.9% (source: final comparison in `paddle_ocr_fine_tune_unir_raytune.ipynb`)"
|
||||||
|
|
||||||
|
**WRONG:** "The optimization improved results by approximately 80%"
|
||||||
|
|
||||||
|
**RIGHT:** "From the 64 trials in `raytune_paddle_subproc_results_20251207_192320.csv`, minimum CER achieved was 1.15%"
|
||||||
|
|
||||||
|
### When Working on Documentation
|
||||||
|
|
||||||
|
1. **Read UNIR guidelines first**: Check `instructions/instrucciones.pdf` for structure requirements
|
||||||
|
|
||||||
|
2. **Follow chapter structure**: Each chapter has specific content requirements per UNIR guidelines
|
||||||
|
|
||||||
|
3. **References are UNIFIED**: All references go in `docs/06_referencias_bibliograficas.md`, NOT per-chapter
|
||||||
|
|
||||||
|
4. **Use APA format**: All citations must follow APA style
|
||||||
|
|
||||||
|
5. **Include "Fuentes de datos"**: Each chapter should list which repository files the data came from
|
||||||
|
|
||||||
|
6. **Language**: Documentation is in Spanish (thesis requirement), code comments in English
|
||||||
|
|
||||||
|
7. **Hardware context**: Remember this is CPU-only execution. Any suggestions about GPU training should acknowledge this limitation
|
||||||
|
|
||||||
|
8. **When in doubt, ask**: If the user requests data that doesn't exist, ask rather than inventing numbers
|
||||||
|
|
||||||
|
9. **DIAGRAMS MUST BE IN MERMAID FORMAT**: All diagrams, flowcharts, and visualizations in the documentation MUST use Mermaid syntax. This ensures:
|
||||||
|
- Version control friendly (text-based)
|
||||||
|
- Consistent styling across all chapters
|
||||||
|
- Easy to edit and maintain
|
||||||
|
- Renders properly in GitHub and most Markdown viewers
|
||||||
|
|
||||||
|
**Supported Mermaid diagram types:**
|
||||||
|
- `flowchart` / `graph` - For pipelines, workflows, architectures
|
||||||
|
- `xychart-beta` - For bar charts, comparisons
|
||||||
|
- `sequenceDiagram` - For process interactions
|
||||||
|
- `classDiagram` - For class structures
|
||||||
|
- `stateDiagram` - For state machines
|
||||||
|
- `pie` - For proportional data
|
||||||
|
|
||||||
|
**Example:**
|
||||||
|
```mermaid
|
||||||
|
flowchart LR
|
||||||
|
A[Input] --> B[Process] --> C[Output]
|
||||||
|
```
|
||||||
|
|
||||||
|
### Common Tasks
|
||||||
|
|
||||||
|
- **Adding new experiments**: Update `src/paddle_ocr_fine_tune_unir_raytune.ipynb`
|
||||||
|
- **Updating documentation**: Edit files in `docs/`
|
||||||
|
- **Adding references**: Add to `docs/06_referencias_bibliograficas.md` (unified list)
|
||||||
|
- **Dataset expansion**: Use `src/prepare_dataset.ipynb` as template
|
||||||
|
- **Running evaluations**: Use `src/paddle_ocr_tuning.py` CLI
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Experiment Details
|
||||||
|
|
||||||
|
### Ray Tune Configuration
|
||||||
|
```python
|
||||||
|
tuner = tune.Tuner(
|
||||||
|
trainable_paddle_ocr,
|
||||||
|
tune_config=tune.TuneConfig(
|
||||||
|
metric="CER",
|
||||||
|
mode="min",
|
||||||
|
search_alg=OptunaSearch(),
|
||||||
|
num_samples=64,
|
||||||
|
max_concurrent_trials=2
|
||||||
|
)
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Dataset
|
||||||
|
- Source: UNIR TFE instructions PDF
|
||||||
|
- Pages: 24
|
||||||
|
- Resolution: 300 DPI
|
||||||
|
- Ground truth: Extracted via PyMuPDF
|
||||||
|
|
||||||
|
### Metrics
|
||||||
|
- CER (Character Error Rate) - Primary metric
|
||||||
|
- WER (Word Error Rate) - Secondary metric
|
||||||
|
- Calculated using `jiwer` library
|
||||||
25
docs/00_resumen.md
Normal file
@@ -0,0 +1,25 @@
|
|||||||
|
# Resumen
|
||||||
|
|
||||||
|
El presente Trabajo Fin de Máster aborda la optimización de sistemas de Reconocimiento Óptico de Caracteres (OCR) basados en inteligencia artificial para documentos en español, específicamente en un entorno con recursos computacionales limitados donde el fine-tuning de modelos no es viable. El objetivo principal es identificar la configuración óptima de hiperparámetros que maximice la precisión del reconocimiento de texto sin requerir entrenamiento adicional de los modelos.
|
||||||
|
|
||||||
|
Se realizó un estudio comparativo de tres soluciones OCR de código abierto: EasyOCR, PaddleOCR (PP-OCRv5) y DocTR, evaluando su rendimiento mediante las métricas estándar CER (Character Error Rate) y WER (Word Error Rate) sobre un corpus de documentos académicos en español. Tras identificar PaddleOCR como la solución más prometedora, se procedió a una optimización sistemática de hiperparámetros utilizando Ray Tune con el algoritmo de búsqueda Optuna, ejecutando 64 configuraciones diferentes.
|
||||||
|
|
||||||
|
Los resultados demuestran que la optimización de hiperparámetros logró una mejora significativa del rendimiento: el CER se redujo de 7.78% a 1.49% (mejora del 80.9% en reducción de errores), alcanzando una precisión de caracteres del 98.51%. El hallazgo más relevante fue que el parámetro `textline_orientation` (clasificación de orientación de línea de texto) tiene un impacto crítico, reduciendo el CER en un 69.7% cuando está habilitado. Adicionalmente, se identificó que el umbral de detección de píxeles (`text_det_thresh`) presenta una correlación negativa fuerte (-0.52) con el error, siendo el parámetro continuo más influyente.
|
||||||
|
|
||||||
|
Este trabajo demuestra que es posible obtener mejoras sustanciales en sistemas OCR mediante optimización de hiperparámetros, ofreciendo una alternativa práctica al fine-tuning cuando los recursos computacionales son limitados.
|
||||||
|
|
||||||
|
**Palabras clave:** OCR, Reconocimiento Óptico de Caracteres, PaddleOCR, Optimización de Hiperparámetros, Ray Tune, Procesamiento de Documentos, Inteligencia Artificial
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
# Abstract
|
||||||
|
|
||||||
|
This Master's Thesis addresses the optimization of Artificial Intelligence-based Optical Character Recognition (OCR) systems for Spanish documents, specifically in a resource-constrained environment where model fine-tuning is not feasible. The main objective is to identify the optimal hyperparameter configuration that maximizes text recognition accuracy without requiring additional model training.
|
||||||
|
|
||||||
|
A comparative study of three open-source OCR solutions was conducted: EasyOCR, PaddleOCR (PP-OCRv5), and DocTR, evaluating their performance using standard CER (Character Error Rate) and WER (Word Error Rate) metrics on a corpus of academic documents in Spanish. After identifying PaddleOCR as the most promising solution, systematic hyperparameter optimization was performed using Ray Tune with the Optuna search algorithm, executing 64 different configurations.
|
||||||
|
|
||||||
|
Results demonstrate that hyperparameter optimization achieved significant performance improvement: CER was reduced from 7.78% to 1.49% (80.9% error reduction), achieving 98.51% character accuracy. The most relevant finding was that the `textline_orientation` parameter (text line orientation classification) has a critical impact, reducing CER by 69.7% when enabled. Additionally, the pixel detection threshold (`text_det_thresh`) was found to have a strong negative correlation (-0.52) with error, being the most influential continuous parameter.
|
||||||
|
|
||||||
|
This work demonstrates that substantial improvements in OCR systems can be obtained through hyperparameter optimization, offering a practical alternative to fine-tuning when computational resources are limited.
|
||||||
|
|
||||||
|
**Keywords:** OCR, Optical Character Recognition, PaddleOCR, Hyperparameter Optimization, Ray Tune, Document Processing, Artificial Intelligence
|
||||||
51
docs/01_introduccion.md
Normal file
@@ -0,0 +1,51 @@
|
|||||||
|
# Introducción
|
||||||
|
|
||||||
|
Este capítulo presenta la motivación del trabajo, identificando el problema a resolver y justificando su relevancia. Se plantea la pregunta de investigación central y se describe la estructura del documento.
|
||||||
|
|
||||||
|
## Motivación
|
||||||
|
|
||||||
|
El Reconocimiento Óptico de Caracteres (OCR) es una tecnología fundamental en la era de la digitalización documental. Su capacidad para convertir imágenes de texto en datos editables y procesables ha transformado sectores como la administración pública, el ámbito legal, la banca y la educación. Sin embargo, a pesar de los avances significativos impulsados por el aprendizaje profundo, la implementación práctica de sistemas OCR de alta precisión sigue presentando desafíos considerables.
|
||||||
|
|
||||||
|
El procesamiento de documentos en español presenta particularidades que complican el reconocimiento automático de texto. Los caracteres especiales (ñ, acentos), las variaciones tipográficas en documentos académicos y administrativos, y la presencia de elementos gráficos como tablas, encabezados y marcas de agua generan errores que pueden propagarse en aplicaciones downstream como la extracción de entidades nombradas o el análisis semántico.
|
||||||
|
|
||||||
|
Los modelos OCR basados en redes neuronales profundas, como los empleados en PaddleOCR, EasyOCR o DocTR, ofrecen un rendimiento impresionante en benchmarks estándar. No obstante, su adaptación a dominios específicos típicamente requiere fine-tuning con datos etiquetados del dominio objetivo y recursos computacionales significativos (GPUs de alta capacidad). Esta barrera técnica y económica excluye a muchos investigadores y organizaciones de beneficiarse plenamente de estas tecnologías.
|
||||||
|
|
||||||
|
La presente investigación surge de una necesidad práctica: optimizar un sistema OCR para documentos académicos en español sin disponer de recursos GPU para realizar fine-tuning. Esta restricción, lejos de ser una limitación excepcional, representa la realidad de muchos entornos académicos y empresariales donde el acceso a infraestructura de cómputo avanzada es limitado.
|
||||||
|
|
||||||
|
## Planteamiento del trabajo
|
||||||
|
|
||||||
|
El problema central que aborda este trabajo puede formularse de la siguiente manera:
|
||||||
|
|
||||||
|
> ¿Es posible mejorar significativamente el rendimiento de modelos OCR preentrenados para documentos en español mediante la optimización sistemática de hiperparámetros, sin requerir fine-tuning ni recursos GPU?
|
||||||
|
|
||||||
|
Este planteamiento se descompone en las siguientes cuestiones específicas:
|
||||||
|
|
||||||
|
1. **Selección de modelo base**: ¿Cuál de las soluciones OCR de código abierto disponibles (EasyOCR, PaddleOCR, DocTR) ofrece el mejor rendimiento base para documentos en español?
|
||||||
|
|
||||||
|
2. **Impacto de hiperparámetros**: ¿Qué hiperparámetros del pipeline OCR tienen mayor influencia en las métricas de error (CER, WER)?
|
||||||
|
|
||||||
|
3. **Optimización automatizada**: ¿Puede un proceso de búsqueda automatizada de hiperparámetros (mediante Ray Tune/Optuna) encontrar configuraciones que superen significativamente los valores por defecto?
|
||||||
|
|
||||||
|
4. **Viabilidad práctica**: ¿Son los tiempos de inferencia y los recursos requeridos compatibles con un despliegue en entornos con recursos limitados?
|
||||||
|
|
||||||
|
La relevancia de este problema radica en su aplicabilidad inmediata. Una metodología reproducible para optimizar OCR sin fine-tuning beneficiaría a:
|
||||||
|
|
||||||
|
- Investigadores que procesan grandes volúmenes de documentos académicos
|
||||||
|
- Instituciones educativas que digitalizan archivos históricos
|
||||||
|
- Pequeñas y medianas empresas que automatizan flujos documentales
|
||||||
|
- Desarrolladores que integran OCR en aplicaciones con restricciones de recursos
|
||||||
|
|
||||||
|
## Estructura del trabajo
|
||||||
|
|
||||||
|
El presente documento se organiza en los siguientes capítulos:
|
||||||
|
|
||||||
|
**Capítulo 2 - Contexto y Estado del Arte**: Se presenta una revisión de las tecnologías OCR basadas en aprendizaje profundo, incluyendo las arquitecturas de detección y reconocimiento de texto, así como los trabajos previos en optimización de estos sistemas.
|
||||||
|
|
||||||
|
**Capítulo 3 - Objetivos y Metodología**: Se definen los objetivos SMART del trabajo y se describe la metodología experimental seguida, incluyendo la preparación del dataset, las métricas de evaluación y el proceso de optimización con Ray Tune.
|
||||||
|
|
||||||
|
**Capítulo 4 - Desarrollo Específico de la Contribución**: Este capítulo presenta el desarrollo completo del estudio comparativo y la optimización de hiperparámetros de sistemas OCR, estructurado en tres secciones: (4.1) planteamiento de la comparativa con la evaluación de EasyOCR, PaddleOCR y DocTR; (4.2) desarrollo de la comparativa con la optimización de hiperparámetros mediante Ray Tune; y (4.3) discusión y análisis de resultados.
|
||||||
|
|
||||||
|
**Capítulo 5 - Conclusiones y Trabajo Futuro**: Se resumen las contribuciones del trabajo, se discute el grado de cumplimiento de los objetivos y se proponen líneas de trabajo futuro.
|
||||||
|
|
||||||
|
**Anexos**: Se incluye el enlace al repositorio de código fuente y datos, así como tablas completas de resultados experimentales.
|
||||||
|
|
||||||
218
docs/02_contexto_estado_arte.md
Normal file
@@ -0,0 +1,218 @@
|
|||||||
|
# Contexto y estado del arte
|
||||||
|
|
||||||
|
Este capítulo presenta el marco teórico y tecnológico en el que se desarrolla el presente trabajo. Se revisan los fundamentos del Reconocimiento Óptico de Caracteres (OCR), la evolución de las técnicas basadas en aprendizaje profundo, las principales soluciones de código abierto disponibles y los trabajos previos relacionados con la optimización de sistemas OCR.
|
||||||
|
|
||||||
|
## Contexto del problema
|
||||||
|
|
||||||
|
### Definición y Evolución Histórica del OCR
|
||||||
|
|
||||||
|
El Reconocimiento Óptico de Caracteres (OCR) es el proceso de conversión de imágenes de texto manuscrito, mecanografiado o impreso en texto codificado digitalmente. La tecnología OCR ha evolucionado significativamente desde sus orígenes en la década de 1950:
|
||||||
|
|
||||||
|
- **Primera generación (1950-1970)**: Sistemas basados en plantillas que requerían fuentes específicas.
|
||||||
|
- **Segunda generación (1970-1990)**: Introducción de técnicas de extracción de características y clasificadores estadísticos.
|
||||||
|
- **Tercera generación (1990-2010)**: Modelos basados en Redes Neuronales Artificiales y Modelos Ocultos de Markov (HMM).
|
||||||
|
- **Cuarta generación (2010-presente)**: Arquitecturas de aprendizaje profundo que dominan el estado del arte.
|
||||||
|
|
||||||
|
### Pipeline Moderno de OCR
|
||||||
|
|
||||||
|
Los sistemas OCR modernos siguen típicamente un pipeline de dos etapas:
|
||||||
|
|
||||||
|
```mermaid
|
||||||
|
---
|
||||||
|
title: "Pipeline de un sistema OCR moderno"
|
||||||
|
---
|
||||||
|
flowchart LR
|
||||||
|
subgraph Input
|
||||||
|
A["Imagen de<br/>documento"]
|
||||||
|
end
|
||||||
|
|
||||||
|
subgraph "Etapa 1: Detección"
|
||||||
|
B["Text Detection<br/>(DB, EAST, CRAFT)"]
|
||||||
|
end
|
||||||
|
|
||||||
|
subgraph "Etapa 2: Reconocimiento"
|
||||||
|
C["Text Recognition<br/>(CRNN, SVTR, Transformer)"]
|
||||||
|
end
|
||||||
|
|
||||||
|
subgraph Output
|
||||||
|
D["Texto<br/>extraído"]
|
||||||
|
end
|
||||||
|
|
||||||
|
A --> B
|
||||||
|
B -->|"Regiones de texto<br/>(bounding boxes)"| C
|
||||||
|
C --> D
|
||||||
|
|
||||||
|
style A fill:#e1f5fe
|
||||||
|
style D fill:#c8e6c9
|
||||||
|
```
|
||||||
|
|
||||||
|
1. **Detección de texto (Text Detection)**: Localización de regiones que contienen texto en la imagen. Las arquitecturas más utilizadas incluyen:
|
||||||
|
- EAST (Efficient and Accurate Scene Text Detector)
|
||||||
|
- CRAFT (Character Region Awareness for Text Detection)
|
||||||
|
- DB (Differentiable Binarization)
|
||||||
|
|
||||||
|
2. **Reconocimiento de texto (Text Recognition)**: Transcripción del contenido textual de las regiones detectadas. Las arquitecturas predominantes son:
|
||||||
|
- CRNN (Convolutional Recurrent Neural Network) con CTC loss
|
||||||
|
- Arquitecturas encoder-decoder con atención
|
||||||
|
- Transformers (ViTSTR, TrOCR)
|
||||||
|
|
||||||
|
### Métricas de Evaluación
|
||||||
|
|
||||||
|
Las métricas estándar para evaluar sistemas OCR son:
|
||||||
|
|
||||||
|
**Character Error Rate (CER)**: Se calcula como CER = (S + D + I) / N, donde S = sustituciones, D = eliminaciones, I = inserciones, N = caracteres de referencia.
|
||||||
|
|
||||||
|
**Word Error Rate (WER)**: Se calcula de forma análoga pero a nivel de palabras en lugar de caracteres.
|
||||||
|
|
||||||
|
Un CER del 1% significa que 1 de cada 100 caracteres es erróneo. Para aplicaciones críticas como extracción de datos financieros o médicos, se requieren CER inferiores al 1%.
|
||||||
|
|
||||||
|
### Particularidades del OCR para el Idioma Español
|
||||||
|
|
||||||
|
El español presenta características específicas que impactan el OCR:
|
||||||
|
|
||||||
|
- **Caracteres especiales**: ñ, á, é, í, ó, ú, ü, ¿, ¡
|
||||||
|
- **Diacríticos**: Los acentos pueden confundirse con ruido o artefactos
|
||||||
|
- **Longitud de palabras**: Palabras generalmente más largas que en inglés
|
||||||
|
- **Puntuación**: Signos de interrogación y exclamación invertidos
|
||||||
|
|
||||||
|
## Estado del arte
|
||||||
|
|
||||||
|
### Soluciones OCR de Código Abierto
|
||||||
|
|
||||||
|
#### EasyOCR
|
||||||
|
|
||||||
|
EasyOCR es una biblioteca de OCR desarrollada por Jaided AI (2020) que soporta más de 80 idiomas. Sus características principales incluyen:
|
||||||
|
|
||||||
|
- **Arquitectura**: Detector CRAFT + Reconocedor CRNN/Transformer
|
||||||
|
- **Fortalezas**: Facilidad de uso, soporte multilingüe amplio, bajo consumo de memoria
|
||||||
|
- **Limitaciones**: Menor precisión en documentos complejos, opciones de configuración limitadas
|
||||||
|
- **Caso de uso ideal**: Prototipado rápido y aplicaciones con restricciones de memoria
|
||||||
|
|
||||||
|
#### PaddleOCR
|
||||||
|
|
||||||
|
PaddleOCR es el sistema OCR desarrollado por Baidu como parte del ecosistema PaddlePaddle (2024). La versión PP-OCRv5, utilizada en este trabajo, representa el estado del arte en OCR industrial:
|
||||||
|
|
||||||
|
- **Arquitectura**:
|
||||||
|
- Detector: DB (Differentiable Binarization) con backbone ResNet (Liao et al., 2020)
|
||||||
|
- Reconocedor: SVTR (Scene-Text Visual Transformer Recognition)
|
||||||
|
- Clasificador de orientación opcional
|
||||||
|
|
||||||
|
- **Hiperparámetros configurables**:
|
||||||
|
|
||||||
|
**Tabla 1.** *Hiperparámetros configurables de PaddleOCR.*
|
||||||
|
|
||||||
|
| Parámetro | Descripción | Valor por defecto |
|
||||||
|
|-----------|-------------|-------------------|
|
||||||
|
| `text_det_thresh` | Umbral de detección de píxeles | 0.3 |
|
||||||
|
| `text_det_box_thresh` | Umbral de caja de detección | 0.6 |
|
||||||
|
| `text_det_unclip_ratio` | Coeficiente de expansión | 1.5 |
|
||||||
|
| `text_rec_score_thresh` | Umbral de confianza de reconocimiento | 0.5 |
|
||||||
|
| `use_textline_orientation` | Clasificación de orientación | False |
|
||||||
|
| `use_doc_orientation_classify` | Clasificación de orientación de documento | False |
|
||||||
|
| `use_doc_unwarping` | Corrección de deformación | False |
|
||||||
|
|
||||||
|
*Fuente: Documentación oficial de PaddleOCR (PaddlePaddle, 2024).*
|
||||||
|
|
||||||
|
- **Fortalezas**: Alta precisión, pipeline altamente configurable, modelos específicos para servidor
|
||||||
|
- **Limitaciones**: Mayor complejidad de configuración, dependencia del framework PaddlePaddle
|
||||||
|
|
||||||
|
#### DocTR
|
||||||
|
|
||||||
|
DocTR (Document Text Recognition) es una biblioteca desarrollada por Mindee (2021) orientada a la investigación:
|
||||||
|
|
||||||
|
- **Arquitectura**:
|
||||||
|
- Detectores: DB, LinkNet
|
||||||
|
- Reconocedores: CRNN, SAR, ViTSTR
|
||||||
|
|
||||||
|
- **Fortalezas**: API limpia, orientación académica, salida estructurada de alto nivel
|
||||||
|
- **Limitaciones**: Menor rendimiento en español comparado con PaddleOCR
|
||||||
|
|
||||||
|
#### Comparativa de Arquitecturas
|
||||||
|
|
||||||
|
**Tabla 2.** *Comparativa de soluciones OCR de código abierto.*
|
||||||
|
|
||||||
|
| Modelo | Tipo | Componentes | Fortalezas Clave |
|
||||||
|
|--------|------|-------------|------------------|
|
||||||
|
| **EasyOCR** | End-to-end (det + rec) | CRAFT + CRNN/Transformer | Ligero, fácil de usar, multilingüe |
|
||||||
|
| **PaddleOCR** | End-to-end (det + rec + cls) | DB + SVTR/CRNN | Soporte multilingüe robusto, configurable |
|
||||||
|
| **DocTR** | End-to-end (det + rec) | DB/LinkNet + CRNN/SAR/ViTSTR | Orientado a investigación, API limpia |
|
||||||
|
|
||||||
|
*Fuente: Documentación oficial de cada herramienta (JaidedAI, 2020; PaddlePaddle, 2024; Mindee, 2021).*
|
||||||
|
|
||||||
|
### Optimización de Hiperparámetros
|
||||||
|
|
||||||
|
#### Fundamentos
|
||||||
|
|
||||||
|
La optimización de hiperparámetros (HPO) busca encontrar la configuración de parámetros que maximiza (o minimiza) una métrica objetivo (Feurer & Hutter, 2019). A diferencia de los parámetros del modelo (pesos), los hiperparámetros no se aprenden durante el entrenamiento.
|
||||||
|
|
||||||
|
Los métodos de HPO incluyen:
|
||||||
|
- **Grid Search**: Búsqueda exhaustiva en una rejilla predefinida
|
||||||
|
- **Random Search**: Muestreo aleatorio del espacio de búsqueda (Bergstra & Bengio, 2012)
|
||||||
|
- **Bayesian Optimization**: Modelado probabilístico de la función objetivo (Bergstra et al., 2011)
|
||||||
|
- **Algoritmos evolutivos**: Optimización inspirada en evolución biológica
|
||||||
|
|
||||||
|
#### Ray Tune y Optuna
|
||||||
|
|
||||||
|
**Ray Tune** es un framework de optimización de hiperparámetros escalable (Liaw et al., 2018) que permite:
|
||||||
|
- Ejecución paralela de experimentos
|
||||||
|
- Early stopping de configuraciones poco prometedoras
|
||||||
|
- Integración con múltiples algoritmos de búsqueda
|
||||||
|
|
||||||
|
**Optuna** es una biblioteca de optimización bayesiana (Akiba et al., 2019) que implementa:
|
||||||
|
- Tree-structured Parzen Estimator (TPE)
|
||||||
|
- Pruning de trials no prometedores
|
||||||
|
- Visualización de resultados
|
||||||
|
|
||||||
|
La combinación Ray Tune + Optuna permite búsquedas eficientes en espacios de alta dimensionalidad.
|
||||||
|
|
||||||
|
```mermaid
|
||||||
|
---
|
||||||
|
title: "Ciclo de optimización con Ray Tune y Optuna"
|
||||||
|
---
|
||||||
|
flowchart LR
|
||||||
|
A["Espacio de<br/>búsqueda"] --> B["Ray Tune<br/>Scheduler"]
|
||||||
|
B --> C["Trials<br/>paralelos"]
|
||||||
|
C --> D["Evaluación<br/>OCR"]
|
||||||
|
D --> E["Métricas<br/>CER/WER"]
|
||||||
|
E --> F["Optuna<br/>TPE"]
|
||||||
|
F -->|"Nueva config"| B
|
||||||
|
```
|
||||||
|
|
||||||
|
#### HPO en Sistemas OCR
|
||||||
|
|
||||||
|
La aplicación de HPO a sistemas OCR ha sido explorada principalmente en el contexto de:
|
||||||
|
|
||||||
|
1. **Preprocesamiento de imagen**: Optimización de parámetros de binarización, filtrado y escalado (Liang et al., 2005)
|
||||||
|
|
||||||
|
2. **Arquitecturas de detección**: Ajuste de umbrales de confianza y NMS (Non-Maximum Suppression)
|
||||||
|
|
||||||
|
3. **Post-procesamiento**: Optimización de corrección ortográfica y modelos de lenguaje
|
||||||
|
|
||||||
|
Sin embargo, existe un vacío en la literatura respecto a la optimización sistemática de los hiperparámetros de inferencia en pipelines OCR modernos como PaddleOCR, especialmente para idiomas diferentes del inglés y chino.
|
||||||
|
|
||||||
|
### Datasets y Benchmarks para Español
|
||||||
|
|
||||||
|
Los principales recursos para evaluación de OCR en español incluyen:
|
||||||
|
|
||||||
|
- **FUNSD-ES**: Versión en español del dataset de formularios
|
||||||
|
- **MLT (ICDAR)**: Multi-Language Text dataset con muestras en español
|
||||||
|
- **Documentos académicos**: Utilizados en este trabajo (instrucciones TFE de UNIR)
|
||||||
|
|
||||||
|
Los trabajos previos en OCR para español se han centrado principalmente en:
|
||||||
|
|
||||||
|
1. Digitalización de archivos históricos (manuscritos coloniales)
|
||||||
|
2. Procesamiento de documentos de identidad
|
||||||
|
3. Reconocimiento de texto en escenas naturales
|
||||||
|
|
||||||
|
La optimización de hiperparámetros para documentos académicos en español representa una contribución original de este trabajo.
|
||||||
|
|
||||||
|
## Conclusiones del capítulo
|
||||||
|
|
||||||
|
Este capítulo ha presentado:
|
||||||
|
|
||||||
|
1. Los fundamentos del OCR moderno y su pipeline de detección-reconocimiento
|
||||||
|
2. Las tres principales soluciones de código abierto: EasyOCR, PaddleOCR y DocTR
|
||||||
|
3. Los métodos de optimización de hiperparámetros, con énfasis en Ray Tune y Optuna
|
||||||
|
4. Las particularidades del OCR para el idioma español
|
||||||
|
|
||||||
|
El estado del arte revela que, si bien existen soluciones OCR de alta calidad, su optimización para dominios específicos mediante ajuste de hiperparámetros (sin fine-tuning) ha recibido poca atención. Este trabajo contribuye a llenar ese vacío proponiendo una metodología reproducible para la optimización de PaddleOCR en documentos académicos en español.
|
||||||
277
docs/03_objetivos_metodologia.md
Normal file
@@ -0,0 +1,277 @@
|
|||||||
|
# Objetivos concretos y metodología de trabajo
|
||||||
|
|
||||||
|
Este capítulo establece los objetivos del trabajo siguiendo la metodología SMART (Doran, 1981) y describe la metodología experimental empleada para alcanzarlos. Se define un objetivo general y cinco objetivos específicos, todos ellos medibles y verificables.
|
||||||
|
|
||||||
|
## Objetivo general
|
||||||
|
|
||||||
|
> **Optimizar el rendimiento de PaddleOCR para documentos académicos en español mediante ajuste de hiperparámetros, alcanzando un CER inferior al 2% sin requerir fine-tuning del modelo ni recursos GPU dedicados.**
|
||||||
|
|
||||||
|
### Justificación SMART del Objetivo General
|
||||||
|
|
||||||
|
| Criterio | Cumplimiento |
|
||||||
|
|----------|--------------|
|
||||||
|
| **Específico (S)** | Se define claramente qué se quiere lograr: optimizar PaddleOCR mediante ajuste de hiperparámetros para documentos en español |
|
||||||
|
| **Medible (M)** | Se establece una métrica cuantificable: CER < 2% |
|
||||||
|
| **Alcanzable (A)** | Es viable dado que: (1) PaddleOCR permite configuración de hiperparámetros, (2) Ray Tune posibilita búsqueda automatizada, (3) No se requiere GPU |
|
||||||
|
| **Relevante (R)** | El impacto es demostrable: mejora la extracción de texto en documentos académicos sin costes adicionales de infraestructura |
|
||||||
|
| **Temporal (T)** | El plazo es un cuatrimestre, correspondiente al TFM |
|
||||||
|
|
||||||
|
## Objetivos específicos
|
||||||
|
|
||||||
|
### OE1: Comparar soluciones OCR de código abierto
|
||||||
|
> **Evaluar el rendimiento base de EasyOCR, PaddleOCR y DocTR en documentos académicos en español, utilizando CER y WER como métricas, para seleccionar el modelo más prometedor.**
|
||||||
|
|
||||||
|
### OE2: Preparar un dataset de evaluación
|
||||||
|
> **Construir un dataset estructurado de imágenes de documentos académicos en español con su texto de referencia (ground truth) extraído del PDF original.**
|
||||||
|
|
||||||
|
### OE3: Identificar hiperparámetros críticos
|
||||||
|
> **Analizar la correlación entre los hiperparámetros de PaddleOCR y las métricas de error para identificar los parámetros con mayor impacto en el rendimiento.**
|
||||||
|
|
||||||
|
### OE4: Optimizar hiperparámetros con Ray Tune
|
||||||
|
> **Ejecutar una búsqueda automatizada de hiperparámetros utilizando Ray Tune con Optuna, evaluando al menos 50 configuraciones diferentes.**
|
||||||
|
|
||||||
|
### OE5: Validar la configuración optimizada
|
||||||
|
> **Comparar el rendimiento de la configuración baseline versus la configuración optimizada sobre el dataset completo, documentando la mejora obtenida.**
|
||||||
|
|
||||||
|
## Metodología del trabajo
|
||||||
|
|
||||||
|
### Visión General
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
```mermaid
|
||||||
|
---
|
||||||
|
title: "Fases de la metodología experimental"
|
||||||
|
---
|
||||||
|
flowchart LR
|
||||||
|
A["Fase 1<br/>Dataset"] --> B["Fase 2<br/>Benchmark"] --> C["Fase 3<br/>Espacio"] --> D["Fase 4<br/>Optimización"] --> E["Fase 5<br/>Validación"]
|
||||||
|
```
|
||||||
|
|
||||||
|
**Descripción de las fases:**
|
||||||
|
|
||||||
|
- **Fase 1 - Preparación del Dataset**: Conversión PDF a imágenes (300 DPI), extracción de ground truth con PyMuPDF
|
||||||
|
- **Fase 2 - Benchmark Comparativo**: Evaluación de EasyOCR, PaddleOCR, DocTR con métricas CER/WER
|
||||||
|
- **Fase 3 - Espacio de Búsqueda**: Identificación de hiperparámetros y configuración de Ray Tune + Optuna
|
||||||
|
- **Fase 4 - Optimización**: Ejecución de 64 trials con paralelización (2 concurrentes)
|
||||||
|
- **Fase 5 - Validación**: Comparación baseline vs optimizado, análisis de correlaciones
|
||||||
|
|
||||||
|
### Fase 1: Preparación del Dataset
|
||||||
|
|
||||||
|
#### Fuente de Datos
|
||||||
|
Se utilizaron documentos PDF académicos de UNIR (Universidad Internacional de La Rioja), específicamente las instrucciones para la elaboración del TFE del Máster en Inteligencia Artificial.
|
||||||
|
|
||||||
|
#### Proceso de Conversión
|
||||||
|
El script `prepare_dataset.ipynb` implementa:
|
||||||
|
|
||||||
|
1. **Conversión PDF a imágenes**:
|
||||||
|
- Biblioteca: PyMuPDF (fitz)
|
||||||
|
- Resolución: 300 DPI
|
||||||
|
- Formato de salida: PNG
|
||||||
|
|
||||||
|
2. **Extracción de texto de referencia**:
|
||||||
|
- Método: `page.get_text("dict")` de PyMuPDF
|
||||||
|
- Preservación de estructura de líneas
|
||||||
|
- Tratamiento de texto vertical/marginal
|
||||||
|
- Normalización de espacios y saltos de línea
|
||||||
|
|
||||||
|
#### Estructura del Dataset
|
||||||
|
|
||||||
|
```mermaid
|
||||||
|
---
|
||||||
|
title: "Estructura del dataset de evaluación"
|
||||||
|
---
|
||||||
|
flowchart LR
|
||||||
|
dataset["dataset/"] --> d0["0/"]
|
||||||
|
|
||||||
|
d0 --> pdf["instrucciones.pdf"]
|
||||||
|
|
||||||
|
d0 --> img["img/"]
|
||||||
|
img --> img1["page_0001.png"]
|
||||||
|
img --> img2["page_0002.png"]
|
||||||
|
img --> imgN["..."]
|
||||||
|
|
||||||
|
d0 --> txt["txt/"]
|
||||||
|
txt --> txt1["page_0001.txt"]
|
||||||
|
txt --> txt2["page_0002.txt"]
|
||||||
|
txt --> txtN["..."]
|
||||||
|
|
||||||
|
dataset --> dots["..."]
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Clase ImageTextDataset
|
||||||
|
|
||||||
|
Se implementó una clase Python para cargar pares imagen-texto:
|
||||||
|
|
||||||
|
```python
|
||||||
|
class ImageTextDataset:
|
||||||
|
def __init__(self, root):
|
||||||
|
# Carga pares (imagen, texto) de carpetas pareadas
|
||||||
|
|
||||||
|
def __getitem__(self, idx):
|
||||||
|
# Retorna (PIL.Image, str)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Fase 2: Benchmark Comparativo
|
||||||
|
|
||||||
|
#### Modelos Evaluados
|
||||||
|
|
||||||
|
| Modelo | Versión | Configuración |
|
||||||
|
|--------|---------|---------------|
|
||||||
|
| EasyOCR | - | Idiomas: ['es', 'en'] |
|
||||||
|
| PaddleOCR | PP-OCRv5 | Modelos server_det + server_rec |
|
||||||
|
| DocTR | - | db_resnet50 + sar_resnet31 |
|
||||||
|
|
||||||
|
#### Métricas de Evaluación
|
||||||
|
|
||||||
|
Se utilizó la biblioteca `jiwer` para calcular:
|
||||||
|
|
||||||
|
```python
|
||||||
|
from jiwer import wer, cer
|
||||||
|
|
||||||
|
def evaluate_text(reference, prediction):
|
||||||
|
return {
|
||||||
|
'WER': wer(reference, prediction),
|
||||||
|
'CER': cer(reference, prediction)
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Fase 3: Espacio de Búsqueda
|
||||||
|
|
||||||
|
#### Hiperparámetros Seleccionados
|
||||||
|
|
||||||
|
| Parámetro | Tipo | Rango/Valores | Descripción |
|
||||||
|
|-----------|------|---------------|-------------|
|
||||||
|
| `use_doc_orientation_classify` | Booleano | [True, False] | Clasificación de orientación del documento |
|
||||||
|
| `use_doc_unwarping` | Booleano | [True, False] | Corrección de deformación del documento |
|
||||||
|
| `textline_orientation` | Booleano | [True, False] | Clasificación de orientación de línea de texto |
|
||||||
|
| `text_det_thresh` | Continuo | [0.0, 0.7] | Umbral de detección de píxeles de texto |
|
||||||
|
| `text_det_box_thresh` | Continuo | [0.0, 0.7] | Umbral de caja de detección |
|
||||||
|
| `text_det_unclip_ratio` | Fijo | 0.0 | Coeficiente de expansión (fijado) |
|
||||||
|
| `text_rec_score_thresh` | Continuo | [0.0, 0.7] | Umbral de confianza de reconocimiento |
|
||||||
|
|
||||||
|
#### Configuración de Ray Tune
|
||||||
|
|
||||||
|
```python
|
||||||
|
from ray import tune
|
||||||
|
from ray.tune.search.optuna import OptunaSearch
|
||||||
|
|
||||||
|
search_space = {
|
||||||
|
"use_doc_orientation_classify": tune.choice([True, False]),
|
||||||
|
"use_doc_unwarping": tune.choice([True, False]),
|
||||||
|
"textline_orientation": tune.choice([True, False]),
|
||||||
|
"text_det_thresh": tune.uniform(0.0, 0.7),
|
||||||
|
"text_det_box_thresh": tune.uniform(0.0, 0.7),
|
||||||
|
"text_det_unclip_ratio": tune.choice([0.0]),
|
||||||
|
"text_rec_score_thresh": tune.uniform(0.0, 0.7),
|
||||||
|
}
|
||||||
|
|
||||||
|
tuner = tune.Tuner(
|
||||||
|
trainable_paddle_ocr,
|
||||||
|
tune_config=tune.TuneConfig(
|
||||||
|
metric="CER",
|
||||||
|
mode="min",
|
||||||
|
search_alg=OptunaSearch(),
|
||||||
|
num_samples=64,
|
||||||
|
max_concurrent_trials=2
|
||||||
|
)
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Fase 4: Ejecución de Optimización
|
||||||
|
|
||||||
|
#### Arquitectura de Ejecución
|
||||||
|
|
||||||
|
Debido a incompatibilidades entre Ray y PaddleOCR en el mismo proceso, se implementó una arquitectura basada en subprocesos:
|
||||||
|
|
||||||
|
```mermaid
|
||||||
|
---
|
||||||
|
title: "Arquitectura de ejecución con subprocesos"
|
||||||
|
---
|
||||||
|
flowchart LR
|
||||||
|
A["Ray Tune (proceso principal)"]
|
||||||
|
|
||||||
|
A --> B["Subprocess 1: paddle_ocr_tuning.py --config"]
|
||||||
|
B --> B_out["Retorna JSON con métricas"]
|
||||||
|
|
||||||
|
A --> C["Subprocess 2: paddle_ocr_tuning.py --config"]
|
||||||
|
C --> C_out["Retorna JSON con métricas"]
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Script de Evaluación (paddle_ocr_tuning.py)
|
||||||
|
|
||||||
|
El script recibe hiperparámetros por línea de comandos:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python paddle_ocr_tuning.py \
|
||||||
|
--pdf-folder ./dataset \
|
||||||
|
--textline-orientation True \
|
||||||
|
--text-det-box-thresh 0.5 \
|
||||||
|
--text-det-thresh 0.4 \
|
||||||
|
--text-rec-score-thresh 0.6
|
||||||
|
```
|
||||||
|
|
||||||
|
Y retorna métricas en formato JSON:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"CER": 0.0125,
|
||||||
|
"WER": 0.1040,
|
||||||
|
"TIME": 331.09,
|
||||||
|
"PAGES": 5,
|
||||||
|
"TIME_PER_PAGE": 66.12
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Fase 5: Validación
|
||||||
|
|
||||||
|
#### Protocolo de Validación
|
||||||
|
|
||||||
|
1. **Baseline**: Ejecución con configuración por defecto de PaddleOCR
|
||||||
|
2. **Optimizado**: Ejecución con mejor configuración encontrada
|
||||||
|
3. **Comparación**: Evaluación sobre las 24 páginas del dataset completo
|
||||||
|
4. **Métricas reportadas**: CER, WER, tiempo de procesamiento
|
||||||
|
|
||||||
|
### Entorno de Ejecución
|
||||||
|
|
||||||
|
#### Hardware
|
||||||
|
|
||||||
|
| Componente | Especificación |
|
||||||
|
|------------|----------------|
|
||||||
|
| CPU | Intel Core (especificar modelo) |
|
||||||
|
| RAM | 16 GB |
|
||||||
|
| GPU | No disponible (ejecución en CPU) |
|
||||||
|
| Almacenamiento | SSD |
|
||||||
|
|
||||||
|
#### Software
|
||||||
|
|
||||||
|
| Componente | Versión |
|
||||||
|
|------------|---------|
|
||||||
|
| Sistema Operativo | Windows 10/11 |
|
||||||
|
| Python | 3.11.9 |
|
||||||
|
| PaddleOCR | 3.3.2 |
|
||||||
|
| PaddlePaddle | 3.2.2 |
|
||||||
|
| Ray | 2.52.1 |
|
||||||
|
| Optuna | 4.6.0 |
|
||||||
|
|
||||||
|
### Limitaciones Metodológicas
|
||||||
|
|
||||||
|
1. **Tamaño del dataset**: El dataset contiene 24 páginas de un único tipo de documento. Resultados pueden no generalizar a otros formatos.
|
||||||
|
|
||||||
|
2. **Ejecución en CPU**: Los tiempos de procesamiento (~70s/página) serían significativamente menores con GPU.
|
||||||
|
|
||||||
|
3. **Ground truth imperfecto**: El texto de referencia extraído de PDF puede contener errores en documentos con layouts complejos.
|
||||||
|
|
||||||
|
4. **Parámetro fijo**: `text_det_unclip_ratio` quedó fijado en 0.0 durante todo el experimento por decisión de diseño inicial.
|
||||||
|
|
||||||
|
## Resumen del capítulo
|
||||||
|
|
||||||
|
Este capítulo ha establecido:
|
||||||
|
|
||||||
|
1. Un objetivo general SMART: alcanzar CER < 2% mediante optimización de hiperparámetros
|
||||||
|
2. Cinco objetivos específicos medibles y alcanzables
|
||||||
|
3. Una metodología experimental en cinco fases claramente definidas
|
||||||
|
4. El espacio de búsqueda de hiperparámetros y la configuración de Ray Tune
|
||||||
|
5. Las limitaciones reconocidas del enfoque
|
||||||
|
|
||||||
|
El siguiente capítulo presenta el desarrollo específico de la contribución, incluyendo el benchmark comparativo de soluciones OCR, la optimización de hiperparámetros y el análisis de resultados.
|
||||||
|
|
||||||
566
docs/04_desarrollo_especifico.md
Normal file
@@ -0,0 +1,566 @@
|
|||||||
|
# Desarrollo específico de la contribución
|
||||||
|
|
||||||
|
Este capítulo presenta el desarrollo completo del estudio comparativo y la optimización de hiperparámetros de sistemas OCR. Se estructura según el tipo de trabajo "Comparativa de soluciones" establecido por las instrucciones de UNIR: planteamiento de la comparativa, desarrollo de la comparativa, y discusión y análisis de resultados.
|
||||||
|
|
||||||
|
## Planteamiento de la comparativa
|
||||||
|
|
||||||
|
### Introducción
|
||||||
|
|
||||||
|
Esta sección presenta los resultados del estudio comparativo realizado entre tres soluciones OCR de código abierto: EasyOCR, PaddleOCR y DocTR. Los experimentos fueron documentados en el notebook `ocr_benchmark_notebook.ipynb` del repositorio. El objetivo es identificar el modelo base más prometedor para la posterior fase de optimización de hiperparámetros.
|
||||||
|
|
||||||
|
### Configuración del Experimento
|
||||||
|
|
||||||
|
#### Dataset de Evaluación
|
||||||
|
|
||||||
|
Se utilizó el documento "Instrucciones para la redacción y elaboración del TFE" del Máster Universitario en Inteligencia Artificial de UNIR, ubicado en la carpeta `instructions/`.
|
||||||
|
|
||||||
|
**Tabla 3.** *Características del dataset de evaluación.*
|
||||||
|
|
||||||
|
| Característica | Valor |
|
||||||
|
|----------------|-------|
|
||||||
|
| Número de páginas evaluadas | 5 (páginas 1-5 en benchmark inicial) |
|
||||||
|
| Formato | PDF digital (no escaneado) |
|
||||||
|
| Idioma | Español |
|
||||||
|
| Resolución de conversión | 300 DPI |
|
||||||
|
|
||||||
|
*Fuente: Elaboración propia.*
|
||||||
|
|
||||||
|
#### Configuración de los Modelos
|
||||||
|
|
||||||
|
Según el código en `ocr_benchmark_notebook.ipynb`:
|
||||||
|
|
||||||
|
**EasyOCR**:
|
||||||
|
```python
|
||||||
|
easyocr_reader = easyocr.Reader(['es', 'en']) # Spanish and English
|
||||||
|
```
|
||||||
|
|
||||||
|
**PaddleOCR (PP-OCRv5)**:
|
||||||
|
```python
|
||||||
|
paddleocr_model = PaddleOCR(
|
||||||
|
text_detection_model_name="PP-OCRv5_server_det",
|
||||||
|
text_recognition_model_name="PP-OCRv5_server_rec",
|
||||||
|
use_doc_orientation_classify=False,
|
||||||
|
use_doc_unwarping=False,
|
||||||
|
use_textline_orientation=True,
|
||||||
|
)
|
||||||
|
```
|
||||||
|
Versión utilizada: PaddleOCR 3.2.0 (según output del notebook)
|
||||||
|
|
||||||
|
**DocTR**:
|
||||||
|
```python
|
||||||
|
doctr_model = ocr_predictor(det_arch="db_resnet50", reco_arch="sar_resnet31", pretrained=True)
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Métricas de Evaluación
|
||||||
|
|
||||||
|
Se utilizó la biblioteca `jiwer` para calcular CER y WER:
|
||||||
|
```python
|
||||||
|
from jiwer import wer, cer
|
||||||
|
|
||||||
|
def evaluate_text(reference, prediction):
|
||||||
|
return {'WER': wer(reference, prediction), 'CER': cer(reference, prediction)}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Resultados del Benchmark
|
||||||
|
|
||||||
|
#### Resultados de PaddleOCR (Configuración Baseline)
|
||||||
|
|
||||||
|
Durante el benchmark inicial se evaluó PaddleOCR con configuración por defecto en un subconjunto del dataset. Los resultados preliminares mostraron variabilidad significativa entre páginas, con CER entre 1.54% y 6.40% dependiendo de la complejidad del layout.
|
||||||
|
|
||||||
|
**Observaciones del benchmark inicial:**
|
||||||
|
- Las páginas con tablas y layouts complejos presentaron mayor error
|
||||||
|
- La página 8 (texto corrido) obtuvo el mejor resultado (CER ~1.5%)
|
||||||
|
- El promedio general se situó en CER ~5-6%
|
||||||
|
|
||||||
|
#### Comparativa de Modelos
|
||||||
|
|
||||||
|
Según la documentación del notebook `ocr_benchmark_notebook.ipynb`, los tres modelos evaluados representan diferentes paradigmas de OCR:
|
||||||
|
|
||||||
|
**Tabla 5.** *Comparativa de arquitecturas OCR evaluadas.*
|
||||||
|
|
||||||
|
| Modelo | Tipo | Componentes | Fortalezas Clave |
|
||||||
|
|--------|------|-------------|------------------|
|
||||||
|
| **EasyOCR** | End-to-end (det + rec) | DB + CRNN/Transformer | Ligero, fácil de usar, multilingüe |
|
||||||
|
| **PaddleOCR (PP-OCR)** | End-to-end (det + rec + cls) | DB + SRN/CRNN | Soporte multilingüe robusto, pipeline configurable |
|
||||||
|
| **DocTR** | End-to-end (det + rec) | DB/LinkNet + CRNN/SAR/VitSTR | Orientado a investigación, API limpia |
|
||||||
|
|
||||||
|
*Fuente: Documentación oficial de cada herramienta (JaidedAI, 2020; PaddlePaddle, 2024; Mindee, 2021).*
|
||||||
|
|
||||||
|
#### Ejemplo de Salida OCR
|
||||||
|
|
||||||
|
Del archivo CSV, un ejemplo de predicción de PaddleOCR para la página 8:
|
||||||
|
|
||||||
|
> "Escribe siempre al menos un párrafo de introducción en cada capítulo o apartado, explicando de qué vas a tratar en esa sección. Evita que aparezcan dos encabezados de nivel consecutivos sin ningún texto entre medias. [...] En esta titulacióon se cita de acuerdo con la normativa Apa."
|
||||||
|
|
||||||
|
**Errores observados en este ejemplo:**
|
||||||
|
- `titulacióon` en lugar de `titulación` (carácter duplicado)
|
||||||
|
- `Apa` en lugar de `APA` (capitalización)
|
||||||
|
|
||||||
|
### Justificación de la Selección de PaddleOCR
|
||||||
|
|
||||||
|
#### Criterios de Selección
|
||||||
|
|
||||||
|
Basándose en los resultados obtenidos y la documentación del benchmark:
|
||||||
|
|
||||||
|
1. **Rendimiento**: PaddleOCR obtuvo CER entre 1.54% y 6.40% en las páginas evaluadas
|
||||||
|
2. **Configurabilidad**: PaddleOCR ofrece múltiples hiperparámetros ajustables:
|
||||||
|
- Umbrales de detección (`text_det_thresh`, `text_det_box_thresh`)
|
||||||
|
- Umbral de reconocimiento (`text_rec_score_thresh`)
|
||||||
|
- Componentes opcionales (`use_textline_orientation`, `use_doc_orientation_classify`, `use_doc_unwarping`)
|
||||||
|
|
||||||
|
3. **Documentación oficial**: [PaddleOCR Documentation](https://www.paddleocr.ai/v3.0.0/en/version3.x/pipeline_usage/OCR.html)
|
||||||
|
|
||||||
|
#### Decisión
|
||||||
|
|
||||||
|
**Se selecciona PaddleOCR (PP-OCRv5)** para la fase de optimización debido a:
|
||||||
|
- Resultados iniciales prometedores (CER ~5%)
|
||||||
|
- Alta configurabilidad de hiperparámetros de inferencia
|
||||||
|
- Pipeline modular que permite experimentación
|
||||||
|
|
||||||
|
### Limitaciones del Benchmark
|
||||||
|
|
||||||
|
1. **Tamaño reducido**: Solo 5 páginas evaluadas en el benchmark comparativo inicial
|
||||||
|
2. **Único tipo de documento**: Documentos académicos de UNIR únicamente
|
||||||
|
3. **Ground truth**: El texto de referencia se extrajo automáticamente del PDF, lo cual puede introducir errores en layouts complejos
|
||||||
|
|
||||||
|
### Resumen de la Sección
|
||||||
|
|
||||||
|
Esta sección ha presentado:
|
||||||
|
|
||||||
|
1. La configuración del benchmark según `ocr_benchmark_notebook.ipynb`
|
||||||
|
2. Los resultados cuantitativos de PaddleOCR del archivo CSV de resultados
|
||||||
|
3. La justificación de la selección de PaddleOCR para optimización
|
||||||
|
|
||||||
|
**Fuentes de datos utilizadas:**
|
||||||
|
- `ocr_benchmark_notebook.ipynb`: Código del benchmark
|
||||||
|
- Documentación oficial de PaddleOCR
|
||||||
|
|
||||||
|
## Desarrollo de la comparativa: Optimización de hiperparámetros
|
||||||
|
|
||||||
|
### Introducción
|
||||||
|
|
||||||
|
Esta sección describe el proceso de optimización de hiperparámetros de PaddleOCR utilizando Ray Tune con el algoritmo de búsqueda Optuna. Los experimentos fueron implementados en el notebook `src/paddle_ocr_fine_tune_unir_raytune.ipynb` y los resultados se almacenaron en `src/raytune_paddle_subproc_results_20251207_192320.csv`.
|
||||||
|
|
||||||
|
### Configuración del Experimento
|
||||||
|
|
||||||
|
#### Entorno de Ejecución
|
||||||
|
|
||||||
|
Según los outputs del notebook:
|
||||||
|
|
||||||
|
**Tabla 6.** *Entorno de ejecución del experimento.*
|
||||||
|
|
||||||
|
| Componente | Versión/Especificación |
|
||||||
|
|------------|------------------------|
|
||||||
|
| Python | 3.11.9 |
|
||||||
|
| PaddlePaddle | 3.2.2 |
|
||||||
|
| PaddleOCR | 3.3.2 |
|
||||||
|
| Ray | 2.52.1 |
|
||||||
|
| GPU | No disponible (CPU only) |
|
||||||
|
|
||||||
|
*Fuente: Outputs del notebook `src/paddle_ocr_fine_tune_unir_raytune.ipynb`.*
|
||||||
|
|
||||||
|
#### Dataset
|
||||||
|
|
||||||
|
Se utilizó un dataset estructurado en `src/dataset/` creado mediante el notebook `src/prepare_dataset.ipynb`:
|
||||||
|
|
||||||
|
- **Estructura**: Carpetas con subcarpetas `img/` y `txt/` pareadas
|
||||||
|
- **Páginas evaluadas por trial**: 5 (páginas 5-10 del documento)
|
||||||
|
- **Gestión de datos**: Clase `ImageTextDataset` en `src/dataset_manager.py`
|
||||||
|
|
||||||
|
#### Espacio de Búsqueda
|
||||||
|
|
||||||
|
Según el código del notebook, se definió el siguiente espacio de búsqueda:
|
||||||
|
|
||||||
|
```python
|
||||||
|
search_space = {
|
||||||
|
"use_doc_orientation_classify": tune.choice([True, False]),
|
||||||
|
"use_doc_unwarping": tune.choice([True, False]),
|
||||||
|
"textline_orientation": tune.choice([True, False]),
|
||||||
|
"text_det_thresh": tune.uniform(0.0, 0.7),
|
||||||
|
"text_det_box_thresh": tune.uniform(0.0, 0.7),
|
||||||
|
"text_det_unclip_ratio": tune.choice([0.0]), # Fijado
|
||||||
|
"text_rec_score_thresh": tune.uniform(0.0, 0.7),
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**Descripción de parámetros** (según documentación de PaddleOCR):
|
||||||
|
|
||||||
|
| Parámetro | Descripción |
|
||||||
|
|-----------|-------------|
|
||||||
|
| `use_doc_orientation_classify` | Clasificación de orientación del documento |
|
||||||
|
| `use_doc_unwarping` | Corrección de deformación del documento |
|
||||||
|
| `textline_orientation` | Clasificación de orientación de línea de texto |
|
||||||
|
| `text_det_thresh` | Umbral de detección de píxeles de texto |
|
||||||
|
| `text_det_box_thresh` | Umbral de caja de detección |
|
||||||
|
| `text_det_unclip_ratio` | Coeficiente de expansión (fijado en 0.0) |
|
||||||
|
| `text_rec_score_thresh` | Umbral de confianza de reconocimiento |
|
||||||
|
|
||||||
|
#### Configuración de Ray Tune
|
||||||
|
|
||||||
|
```python
|
||||||
|
tuner = tune.Tuner(
|
||||||
|
trainable_paddle_ocr,
|
||||||
|
tune_config=tune.TuneConfig(
|
||||||
|
metric="CER",
|
||||||
|
mode="min",
|
||||||
|
search_alg=OptunaSearch(),
|
||||||
|
num_samples=64,
|
||||||
|
max_concurrent_trials=2
|
||||||
|
),
|
||||||
|
run_config=air.RunConfig(verbose=2, log_to_file=False),
|
||||||
|
param_space=search_space
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
- **Métrica objetivo**: CER (minimizar)
|
||||||
|
- **Algoritmo de búsqueda**: Optuna (TPE - Tree-structured Parzen Estimator)
|
||||||
|
- **Número de trials**: 64
|
||||||
|
- **Trials concurrentes**: 2
|
||||||
|
|
||||||
|
### Resultados de la Optimización
|
||||||
|
|
||||||
|
#### Estadísticas Descriptivas
|
||||||
|
|
||||||
|
Del archivo CSV de resultados (`raytune_paddle_subproc_results_20251207_192320.csv`):
|
||||||
|
|
||||||
|
**Tabla 7.** *Estadísticas descriptivas de los 64 trials de Ray Tune.*
|
||||||
|
|
||||||
|
| Estadística | CER | WER | Tiempo (s) | Tiempo/Página (s) |
|
||||||
|
|-------------|-----|-----|------------|-------------------|
|
||||||
|
| **count** | 64 | 64 | 64 | 64 |
|
||||||
|
| **mean** | 5.25% | 14.28% | 347.61 | 69.42 |
|
||||||
|
| **std** | 11.03% | 10.75% | 7.88 | 1.57 |
|
||||||
|
| **min** | 1.15% | 9.89% | 320.97 | 64.10 |
|
||||||
|
| **25%** | 1.20% | 10.04% | 344.24 | 68.76 |
|
||||||
|
| **50%** | 1.23% | 10.20% | 346.42 | 69.19 |
|
||||||
|
| **75%** | 4.03% | 13.20% | 350.14 | 69.93 |
|
||||||
|
| **max** | 51.61% | 59.45% | 368.57 | 73.63 |
|
||||||
|
|
||||||
|
*Fuente: `src/raytune_paddle_subproc_results_20251207_192320.csv`.*
|
||||||
|
|
||||||
|
#### Mejor Configuración Encontrada
|
||||||
|
|
||||||
|
Según el análisis del notebook:
|
||||||
|
|
||||||
|
```
|
||||||
|
Best CER: 0.011535 (1.15%)
|
||||||
|
Best WER: 0.098902 (9.89%)
|
||||||
|
|
||||||
|
Configuración óptima:
|
||||||
|
textline_orientation: True
|
||||||
|
use_doc_orientation_classify: False
|
||||||
|
use_doc_unwarping: False
|
||||||
|
text_det_thresh: 0.4690
|
||||||
|
text_det_box_thresh: 0.5412
|
||||||
|
text_det_unclip_ratio: 0.0
|
||||||
|
text_rec_score_thresh: 0.6350
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Análisis de Correlación
|
||||||
|
|
||||||
|
Correlación de Pearson entre parámetros y métricas de error (del notebook):
|
||||||
|
|
||||||
|
**Correlación con CER:**
|
||||||
|
| Parámetro | Correlación |
|
||||||
|
|-----------|-------------|
|
||||||
|
| CER | 1.000 |
|
||||||
|
| config/text_det_box_thresh | 0.226 |
|
||||||
|
| config/text_rec_score_thresh | -0.161 |
|
||||||
|
| **config/text_det_thresh** | **-0.523** |
|
||||||
|
| config/text_det_unclip_ratio | NaN |
|
||||||
|
|
||||||
|
**Correlación con WER:**
|
||||||
|
| Parámetro | Correlación |
|
||||||
|
|-----------|-------------|
|
||||||
|
| WER | 1.000 |
|
||||||
|
| config/text_det_box_thresh | 0.227 |
|
||||||
|
| config/text_rec_score_thresh | -0.173 |
|
||||||
|
| **config/text_det_thresh** | **-0.521** |
|
||||||
|
| config/text_det_unclip_ratio | NaN |
|
||||||
|
|
||||||
|
**Hallazgo clave**: El parámetro `text_det_thresh` muestra la correlación más fuerte (-0.52), indicando que valores más altos de este umbral tienden a reducir el error.
|
||||||
|
|
||||||
|
#### Impacto del Parámetro textline_orientation
|
||||||
|
|
||||||
|
Según el análisis del notebook, este parámetro booleano tiene el mayor impacto:
|
||||||
|
|
||||||
|
**Tabla 8.** *Impacto del parámetro textline_orientation en las métricas de error.*
|
||||||
|
|
||||||
|
| textline_orientation | CER Medio | WER Medio |
|
||||||
|
|---------------------|-----------|-----------|
|
||||||
|
| True | ~3.76% | ~12.73% |
|
||||||
|
| False | ~12.40% | ~21.71% |
|
||||||
|
|
||||||
|
*Fuente: Análisis del notebook `src/paddle_ocr_fine_tune_unir_raytune.ipynb`.*
|
||||||
|
|
||||||
|
**Interpretación**:
|
||||||
|
El CER medio es ~3.3x menor con `textline_orientation=True` (3.76% vs 12.40%). Además, la varianza es mucho menor, lo que indica resultados más consistentes. Para documentos en español con layouts mixtos (tablas, encabezados, direcciones), la clasificación de orientación ayuda a PaddleOCR a ordenar correctamente las líneas de texto.
|
||||||
|
|
||||||
|
```mermaid
|
||||||
|
%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#0098CD'}}}%%
|
||||||
|
xychart-beta
|
||||||
|
title "Impacto de textline_orientation en CER"
|
||||||
|
x-axis ["textline_orientation=False", "textline_orientation=True"]
|
||||||
|
y-axis "CER (%)" 0 --> 15
|
||||||
|
bar [12.40, 3.76]
|
||||||
|
```
|
||||||
|
|
||||||
|
*Figura 3. Comparación del CER medio según el valor del parámetro textline_orientation.*
|
||||||
|
|
||||||
|
#### Análisis de Fallos
|
||||||
|
|
||||||
|
Los trials con CER muy alto (>40%) se produjeron cuando:
|
||||||
|
- `text_det_thresh` < 0.1 (valores muy bajos)
|
||||||
|
- `textline_orientation = False`
|
||||||
|
|
||||||
|
Ejemplo de trial con fallo catastrófico:
|
||||||
|
- CER: 51.61%
|
||||||
|
- WER: 59.45%
|
||||||
|
- Configuración: `text_det_thresh=0.017`, `textline_orientation=True`
|
||||||
|
|
||||||
|
### Comparación Baseline vs Optimizado
|
||||||
|
|
||||||
|
#### Resultados sobre Dataset Completo (24 páginas)
|
||||||
|
|
||||||
|
Del análisis final del notebook ejecutando sobre las 24 páginas:
|
||||||
|
|
||||||
|
**Tabla 9.** *Comparación baseline vs configuración optimizada (24 páginas).*
|
||||||
|
|
||||||
|
| Modelo | CER | WER |
|
||||||
|
|--------|-----|-----|
|
||||||
|
| PaddleOCR (Baseline) | 7.78% | 14.94% |
|
||||||
|
| PaddleOCR-HyperAdjust | 1.49% | 7.62% |
|
||||||
|
|
||||||
|
*Fuente: Ejecución final en notebook `src/paddle_ocr_fine_tune_unir_raytune.ipynb`.*
|
||||||
|
|
||||||
|
#### Métricas de Mejora
|
||||||
|
|
||||||
|
**Tabla 10.** *Análisis de la mejora obtenida.*
|
||||||
|
|
||||||
|
| Métrica | Baseline | Optimizado | Mejora Absoluta | Reducción Error |
|
||||||
|
|---------|----------|------------|-----------------|-----------------|
|
||||||
|
| CER | 7.78% | 1.49% | -6.29 pp | 80.9% |
|
||||||
|
| WER | 14.94% | 7.62% | -7.32 pp | 49.0% |
|
||||||
|
|
||||||
|
*Fuente: Elaboración propia a partir de los resultados experimentales.*
|
||||||
|
|
||||||
|
#### Interpretación (del notebook)
|
||||||
|
|
||||||
|
> "La optimización de hiperparámetros mejoró la precisión de caracteres de 92.2% a 98.5%, una ganancia de 6.3 puntos porcentuales. Aunque el baseline ya ofrecía resultados aceptables, la configuración optimizada reduce los errores residuales en un 80.9%."
|
||||||
|
|
||||||
|
```mermaid
|
||||||
|
%%{init: {'theme': 'base'}}%%
|
||||||
|
xychart-beta
|
||||||
|
title "Comparación Baseline vs Optimizado (24 páginas)"
|
||||||
|
x-axis ["CER", "WER"]
|
||||||
|
y-axis "Tasa de error (%)" 0 --> 16
|
||||||
|
bar "Baseline" [7.78, 14.94]
|
||||||
|
bar "Optimizado" [1.49, 7.62]
|
||||||
|
```
|
||||||
|
|
||||||
|
*Figura 4. Comparación de métricas de error entre configuración baseline y optimizada.*
|
||||||
|
|
||||||
|
**Impacto práctico**: En un documento de 10,000 caracteres:
|
||||||
|
- Baseline: ~778 caracteres con error
|
||||||
|
- Optimizado: ~149 caracteres con error
|
||||||
|
- Diferencia: ~629 caracteres menos con errores
|
||||||
|
|
||||||
|
### Tiempo de Ejecución
|
||||||
|
|
||||||
|
| Métrica | Valor |
|
||||||
|
|---------|-------|
|
||||||
|
| Tiempo total del experimento | ~6 horas (64 trials × ~6 min/trial) |
|
||||||
|
| Tiempo medio por trial | 367.72 segundos |
|
||||||
|
| Tiempo medio por página | 69.42 segundos |
|
||||||
|
| Total páginas procesadas | 64 trials × 5 páginas = 320 evaluaciones |
|
||||||
|
|
||||||
|
### Resumen de la Sección
|
||||||
|
|
||||||
|
Esta sección ha presentado:
|
||||||
|
|
||||||
|
1. **Configuración del experimento**: 64 trials con Ray Tune + Optuna sobre 7 hiperparámetros
|
||||||
|
2. **Resultados estadísticos**: CER medio 5.25%, CER mínimo 1.15%
|
||||||
|
3. **Hallazgos clave**:
|
||||||
|
- `textline_orientation=True` es crítico (reduce CER ~70%)
|
||||||
|
- `text_det_thresh` tiene correlación -0.52 con CER
|
||||||
|
- Valores bajos de `text_det_thresh` (<0.1) causan fallos catastróficos
|
||||||
|
4. **Mejora final**: CER reducido de 7.78% a 1.49% (reducción del 80.9%)
|
||||||
|
|
||||||
|
**Fuentes de datos:**
|
||||||
|
- `src/paddle_ocr_fine_tune_unir_raytune.ipynb`: Código del experimento
|
||||||
|
- `src/raytune_paddle_subproc_results_20251207_192320.csv`: Resultados de 64 trials
|
||||||
|
- `src/paddle_ocr_tuning.py`: Script de evaluación
|
||||||
|
|
||||||
|
## Discusión y análisis de resultados
|
||||||
|
|
||||||
|
### Introducción
|
||||||
|
|
||||||
|
Esta sección presenta un análisis consolidado de los resultados obtenidos en las fases de benchmark comparativo y optimización de hiperparámetros. Se discuten las implicaciones prácticas y se evalúa el cumplimiento de los objetivos planteados.
|
||||||
|
|
||||||
|
### Resumen de Resultados
|
||||||
|
|
||||||
|
#### Resultados del Benchmark Comparativo
|
||||||
|
|
||||||
|
En el benchmark inicial, PaddleOCR con configuración por defecto mostró variabilidad en el rendimiento según la complejidad de cada página, con CER promedio en torno al 5-6% y variaciones significativas entre páginas con layouts simples (~1.5%) y complejos (~6.4%).
|
||||||
|
|
||||||
|
#### Resultados de la Optimización con Ray Tune
|
||||||
|
|
||||||
|
Del archivo `src/raytune_paddle_subproc_results_20251207_192320.csv` (64 trials):
|
||||||
|
|
||||||
|
| Métrica | Valor |
|
||||||
|
|---------|-------|
|
||||||
|
| CER mínimo | 1.15% |
|
||||||
|
| CER medio | 5.25% |
|
||||||
|
| CER máximo | 51.61% |
|
||||||
|
| WER mínimo | 9.89% |
|
||||||
|
| WER medio | 14.28% |
|
||||||
|
| WER máximo | 59.45% |
|
||||||
|
|
||||||
|
#### Comparación Final (Dataset Completo - 24 páginas)
|
||||||
|
|
||||||
|
Resultados del notebook `src/paddle_ocr_fine_tune_unir_raytune.ipynb`:
|
||||||
|
|
||||||
|
| Modelo | CER | Precisión Caracteres | WER | Precisión Palabras |
|
||||||
|
|--------|-----|---------------------|-----|-------------------|
|
||||||
|
| PaddleOCR (Baseline) | 7.78% | 92.22% | 14.94% | 85.06% |
|
||||||
|
| PaddleOCR-HyperAdjust | 1.49% | 98.51% | 7.62% | 92.38% |
|
||||||
|
|
||||||
|
### Análisis de Resultados
|
||||||
|
|
||||||
|
#### Mejora Obtenida
|
||||||
|
|
||||||
|
| Forma de Medición | Valor |
|
||||||
|
|-------------------|-------|
|
||||||
|
| Mejora en precisión de caracteres (absoluta) | +6.29 puntos porcentuales |
|
||||||
|
| Reducción del CER (relativa) | 80.9% |
|
||||||
|
| Mejora en precisión de palabras (absoluta) | +7.32 puntos porcentuales |
|
||||||
|
| Reducción del WER (relativa) | 49.0% |
|
||||||
|
| Precisión final de caracteres | 98.51% |
|
||||||
|
|
||||||
|
#### Impacto de Hiperparámetros Individuales
|
||||||
|
|
||||||
|
**Parámetro `textline_orientation`**
|
||||||
|
|
||||||
|
Este parámetro booleano demostró ser el más influyente:
|
||||||
|
|
||||||
|
| Valor | CER Medio | Impacto |
|
||||||
|
|-------|-----------|---------|
|
||||||
|
| True | ~3.76% | Rendimiento óptimo |
|
||||||
|
| False | ~12.40% | 3.3x peor |
|
||||||
|
|
||||||
|
**Reducción del CER**: 69.7% cuando se habilita la clasificación de orientación de línea.
|
||||||
|
|
||||||
|
**Parámetro `text_det_thresh`**
|
||||||
|
|
||||||
|
Correlación con CER: **-0.523** (la más fuerte de los parámetros continuos)
|
||||||
|
|
||||||
|
| Rango | Comportamiento |
|
||||||
|
|-------|----------------|
|
||||||
|
| < 0.1 | Fallos catastróficos (CER 40-50%) |
|
||||||
|
| 0.3 - 0.6 | Rendimiento óptimo |
|
||||||
|
| Valor óptimo | 0.4690 |
|
||||||
|
|
||||||
|
**Parámetros con menor impacto**
|
||||||
|
|
||||||
|
| Parámetro | Correlación con CER | Valor óptimo |
|
||||||
|
|-----------|---------------------|--------------|
|
||||||
|
| text_det_box_thresh | +0.226 | 0.5412 |
|
||||||
|
| text_rec_score_thresh | -0.161 | 0.6350 |
|
||||||
|
| use_doc_orientation_classify | - | False |
|
||||||
|
| use_doc_unwarping | - | False |
|
||||||
|
|
||||||
|
#### Configuración Óptima Final
|
||||||
|
|
||||||
|
```python
|
||||||
|
config_optimizada = {
|
||||||
|
"textline_orientation": True, # CRÍTICO
|
||||||
|
"use_doc_orientation_classify": False,
|
||||||
|
"use_doc_unwarping": False,
|
||||||
|
"text_det_thresh": 0.4690, # Correlación -0.52
|
||||||
|
"text_det_box_thresh": 0.5412,
|
||||||
|
"text_det_unclip_ratio": 0.0,
|
||||||
|
"text_rec_score_thresh": 0.6350,
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Discusión
|
||||||
|
|
||||||
|
#### Hallazgos Principales
|
||||||
|
|
||||||
|
1. **Importancia de la clasificación de orientación de línea**: El parámetro `textline_orientation=True` es el factor más determinante. Esto tiene sentido para documentos con layouts mixtos (tablas, encabezados, direcciones) donde el orden correcto de las líneas de texto es crucial.
|
||||||
|
|
||||||
|
2. **Umbral de detección crítico**: El parámetro `text_det_thresh` presenta un umbral mínimo efectivo (~0.1). Valores inferiores generan demasiados falsos positivos en la detección, corrompiendo el reconocimiento posterior.
|
||||||
|
|
||||||
|
3. **Componentes opcionales innecesarios**: Para documentos académicos digitales (no escaneados), los módulos de corrección de orientación de documento (`use_doc_orientation_classify`) y corrección de deformación (`use_doc_unwarping`) no aportan mejora e incluso pueden introducir overhead.
|
||||||
|
|
||||||
|
#### Interpretación de la Correlación Negativa
|
||||||
|
|
||||||
|
La correlación negativa de `text_det_thresh` (-0.52) con el CER indica que:
|
||||||
|
- Umbrales más altos filtran detecciones de baja confianza
|
||||||
|
- Esto reduce falsos positivos que generan texto erróneo
|
||||||
|
- El reconocimiento es más preciso con menos regiones pero más confiables
|
||||||
|
|
||||||
|
#### Limitaciones de los Resultados
|
||||||
|
|
||||||
|
1. **Generalización**: Los resultados se obtuvieron sobre documentos de un único tipo (instrucciones académicas UNIR). La configuración óptima puede variar para otros tipos de documentos.
|
||||||
|
|
||||||
|
2. **Ground truth automático**: El texto de referencia se extrajo programáticamente del PDF. En layouts complejos, esto puede introducir errores en la evaluación.
|
||||||
|
|
||||||
|
3. **Ejecución en CPU**: Los tiempos reportados (~69s/página) corresponden a ejecución en CPU. Con GPU, los tiempos serían significativamente menores.
|
||||||
|
|
||||||
|
4. **Parámetro fijo**: `text_det_unclip_ratio` permaneció fijo en 0.0 durante todo el experimento por decisión de diseño.
|
||||||
|
|
||||||
|
#### Comparación con Objetivos
|
||||||
|
|
||||||
|
| Objetivo | Meta | Resultado | Cumplimiento |
|
||||||
|
|----------|------|-----------|--------------|
|
||||||
|
| OE1: Comparar soluciones OCR | Evaluar EasyOCR, PaddleOCR, DocTR | PaddleOCR seleccionado | ✓ |
|
||||||
|
| OE2: Preparar dataset | Construir dataset estructurado | Dataset de 24 páginas | ✓ |
|
||||||
|
| OE3: Identificar hiperparámetros críticos | Analizar correlaciones | `textline_orientation` y `text_det_thresh` identificados | ✓ |
|
||||||
|
| OE4: Optimizar con Ray Tune | Mínimo 50 configuraciones | 64 trials ejecutados | ✓ |
|
||||||
|
| OE5: Validar configuración | Documentar mejora | CER 7.78% → 1.49% | ✓ |
|
||||||
|
| **Objetivo General** | CER < 2% | CER = 1.49% | ✓ |
|
||||||
|
|
||||||
|
### Implicaciones Prácticas
|
||||||
|
|
||||||
|
#### Recomendaciones de Configuración
|
||||||
|
|
||||||
|
Para documentos académicos en español similares a los evaluados:
|
||||||
|
|
||||||
|
1. **Obligatorio**: `use_textline_orientation=True`
|
||||||
|
2. **Recomendado**: `text_det_thresh` entre 0.4 y 0.5
|
||||||
|
3. **Opcional**: `text_det_box_thresh` ~0.5, `text_rec_score_thresh` >0.6
|
||||||
|
4. **No recomendado**: Habilitar `use_doc_orientation_classify` o `use_doc_unwarping` para documentos digitales
|
||||||
|
|
||||||
|
#### Impacto Cuantitativo
|
||||||
|
|
||||||
|
En un documento típico de 10,000 caracteres:
|
||||||
|
|
||||||
|
| Configuración | Errores estimados |
|
||||||
|
|---------------|-------------------|
|
||||||
|
| Baseline | ~778 caracteres |
|
||||||
|
| Optimizada | ~149 caracteres |
|
||||||
|
| **Reducción** | **629 caracteres menos con errores** |
|
||||||
|
|
||||||
|
#### Aplicabilidad
|
||||||
|
|
||||||
|
Esta metodología de optimización es aplicable cuando:
|
||||||
|
- No se dispone de recursos GPU para fine-tuning
|
||||||
|
- El modelo preentrenado ya tiene soporte para el idioma objetivo
|
||||||
|
- Se busca mejorar rendimiento sin reentrenar
|
||||||
|
|
||||||
|
### Resumen de la Sección
|
||||||
|
|
||||||
|
Esta sección ha presentado:
|
||||||
|
|
||||||
|
1. Los resultados consolidados del benchmark y la optimización
|
||||||
|
2. El análisis del impacto de cada hiperparámetro
|
||||||
|
3. La configuración óptima identificada
|
||||||
|
4. La discusión de limitaciones y aplicabilidad
|
||||||
|
5. El cumplimiento de los objetivos planteados
|
||||||
|
|
||||||
|
**Resultado principal**: Se logró reducir el CER del 7.78% al 1.49% (mejora del 80.9%) mediante optimización de hiperparámetros, cumpliendo el objetivo de alcanzar CER < 2%.
|
||||||
|
|
||||||
|
**Fuentes de datos:**
|
||||||
|
- `src/raytune_paddle_subproc_results_20251207_192320.csv`: Resultados de 64 trials de optimización
|
||||||
|
- `src/paddle_ocr_fine_tune_unir_raytune.ipynb`: Notebook principal del experimento
|
||||||
113
docs/05_conclusiones_trabajo_futuro.md
Normal file
@@ -0,0 +1,113 @@
|
|||||||
|
# Conclusiones y trabajo futuro
|
||||||
|
|
||||||
|
Este capítulo resume las principales conclusiones del trabajo, evalúa el grado de cumplimiento de los objetivos planteados y propone líneas de trabajo futuro que permitirían ampliar y profundizar los resultados obtenidos.
|
||||||
|
|
||||||
|
## Conclusiones
|
||||||
|
|
||||||
|
### Conclusiones Generales
|
||||||
|
|
||||||
|
Este Trabajo Fin de Máster ha demostrado que es posible mejorar significativamente el rendimiento de sistemas OCR preentrenados mediante optimización sistemática de hiperparámetros, sin requerir fine-tuning ni recursos GPU dedicados.
|
||||||
|
|
||||||
|
El objetivo principal del trabajo era alcanzar un CER inferior al 2% en documentos académicos en español. Los resultados obtenidos confirman el cumplimiento de este objetivo:
|
||||||
|
|
||||||
|
| Métrica | Objetivo | Resultado |
|
||||||
|
|---------|----------|-----------|
|
||||||
|
| CER | < 2% | **1.49%** |
|
||||||
|
|
||||||
|
### Conclusiones Específicas
|
||||||
|
|
||||||
|
**Respecto a OE1 (Comparativa de soluciones OCR)**:
|
||||||
|
- Se evaluaron tres soluciones OCR de código abierto: EasyOCR, PaddleOCR (PP-OCRv5) y DocTR
|
||||||
|
- PaddleOCR demostró el mejor rendimiento base para documentos en español
|
||||||
|
- La configurabilidad del pipeline de PaddleOCR lo hace idóneo para optimización
|
||||||
|
|
||||||
|
**Respecto a OE2 (Preparación del dataset)**:
|
||||||
|
- Se construyó un dataset estructurado con 24 páginas de documentos académicos
|
||||||
|
- La clase `ImageTextDataset` facilita la carga de pares imagen-texto
|
||||||
|
- El ground truth se extrajo automáticamente del PDF mediante PyMuPDF
|
||||||
|
|
||||||
|
**Respecto a OE3 (Identificación de hiperparámetros críticos)**:
|
||||||
|
- El parámetro `textline_orientation` es el más influyente: reduce el CER en un 69.7% cuando está habilitado
|
||||||
|
- El umbral `text_det_thresh` presenta la correlación más fuerte (-0.52) con el CER
|
||||||
|
- Los parámetros de corrección de documento (`use_doc_orientation_classify`, `use_doc_unwarping`) no aportan mejora en documentos digitales
|
||||||
|
|
||||||
|
**Respecto a OE4 (Optimización con Ray Tune)**:
|
||||||
|
- Se ejecutaron 64 trials con el algoritmo OptunaSearch
|
||||||
|
- El tiempo total del experimento fue aproximadamente 6 horas (en CPU)
|
||||||
|
- La arquitectura basada en subprocesos permitió superar incompatibilidades entre Ray y PaddleOCR
|
||||||
|
|
||||||
|
**Respecto a OE5 (Validación de la configuración)**:
|
||||||
|
- Se validó la configuración óptima sobre el dataset completo de 24 páginas
|
||||||
|
- La mejora obtenida fue del 80.9% en reducción del CER (7.78% → 1.49%)
|
||||||
|
- La precisión de caracteres alcanzó el 98.51%
|
||||||
|
|
||||||
|
### Hallazgos Clave
|
||||||
|
|
||||||
|
1. **Arquitectura sobre umbrales**: Un único parámetro booleano (`textline_orientation`) tiene más impacto que todos los umbrales continuos combinados.
|
||||||
|
|
||||||
|
2. **Umbrales mínimos efectivos**: Valores de `text_det_thresh` < 0.1 causan fallos catastróficos (CER >40%).
|
||||||
|
|
||||||
|
3. **Simplicidad para documentos digitales**: Para documentos PDF digitales (no escaneados), los módulos de corrección de orientación y deformación son innecesarios.
|
||||||
|
|
||||||
|
4. **Optimización sin fine-tuning**: Se puede mejorar significativamente el rendimiento de modelos preentrenados mediante ajuste de hiperparámetros de inferencia.
|
||||||
|
|
||||||
|
### Contribuciones del Trabajo
|
||||||
|
|
||||||
|
1. **Metodología reproducible**: Se documenta un proceso completo de optimización de hiperparámetros OCR con Ray Tune + Optuna.
|
||||||
|
|
||||||
|
2. **Análisis de hiperparámetros de PaddleOCR**: Se cuantifica el impacto de cada parámetro configurable mediante correlaciones y análisis comparativo.
|
||||||
|
|
||||||
|
3. **Configuración óptima para español**: Se proporciona una configuración validada para documentos académicos en español.
|
||||||
|
|
||||||
|
4. **Código fuente**: Todo el código está disponible en el repositorio GitHub para reproducción y extensión.
|
||||||
|
|
||||||
|
### Limitaciones del Trabajo
|
||||||
|
|
||||||
|
1. **Tipo de documento único**: Los experimentos se realizaron únicamente sobre documentos académicos de UNIR. La generalización a otros tipos de documentos requiere validación adicional.
|
||||||
|
|
||||||
|
2. **Tamaño del dataset**: 24 páginas es un corpus limitado para conclusiones estadísticamente robustas.
|
||||||
|
|
||||||
|
3. **Ground truth automático**: La extracción automática del texto de referencia puede introducir errores en layouts complejos.
|
||||||
|
|
||||||
|
4. **Ejecución en CPU**: Los tiempos de procesamiento (~69s/página) limitan la aplicabilidad en escenarios de alto volumen.
|
||||||
|
|
||||||
|
5. **Parámetro no explorado**: `text_det_unclip_ratio` permaneció fijo en 0.0 durante todo el experimento.
|
||||||
|
|
||||||
|
## Líneas de trabajo futuro
|
||||||
|
|
||||||
|
### Extensiones Inmediatas
|
||||||
|
|
||||||
|
1. **Validación cruzada**: Evaluar la configuración óptima en otros tipos de documentos en español (facturas, formularios, textos manuscritos).
|
||||||
|
|
||||||
|
2. **Exploración de `text_det_unclip_ratio`**: Incluir este parámetro en el espacio de búsqueda.
|
||||||
|
|
||||||
|
3. **Dataset ampliado**: Construir un corpus más amplio y diverso de documentos en español.
|
||||||
|
|
||||||
|
4. **Evaluación con GPU**: Medir tiempos de inferencia con aceleración GPU.
|
||||||
|
|
||||||
|
### Líneas de Investigación
|
||||||
|
|
||||||
|
1. **Transfer learning de hiperparámetros**: Investigar si las configuraciones óptimas para un tipo de documento transfieren a otros dominios.
|
||||||
|
|
||||||
|
2. **Optimización multi-objetivo**: Considerar simultáneamente CER, WER y tiempo de inferencia como objetivos.
|
||||||
|
|
||||||
|
3. **AutoML para OCR**: Aplicar técnicas de AutoML más avanzadas (Neural Architecture Search, meta-learning).
|
||||||
|
|
||||||
|
4. **Comparación con fine-tuning**: Cuantificar la brecha de rendimiento entre optimización de hiperparámetros y fine-tuning real.
|
||||||
|
|
||||||
|
### Aplicaciones Prácticas
|
||||||
|
|
||||||
|
1. **Herramienta de configuración automática**: Desarrollar una herramienta que determine automáticamente la configuración óptima para un nuevo tipo de documento.
|
||||||
|
|
||||||
|
2. **Integración en pipelines de producción**: Implementar la configuración optimizada en sistemas reales de procesamiento documental.
|
||||||
|
|
||||||
|
3. **Benchmark público**: Publicar un benchmark de OCR para documentos en español que facilite la comparación de soluciones.
|
||||||
|
|
||||||
|
### Reflexión Final
|
||||||
|
|
||||||
|
Este trabajo demuestra que, en un contexto de recursos limitados donde el fine-tuning de modelos de deep learning no es viable, la optimización de hiperparámetros representa una alternativa práctica y efectiva para mejorar sistemas OCR.
|
||||||
|
|
||||||
|
La metodología propuesta es reproducible, los resultados son cuantificables, y las conclusiones son aplicables a escenarios reales de procesamiento documental. La reducción del CER del 7.78% al 1.49% representa una mejora sustancial que puede tener impacto directo en aplicaciones downstream como extracción de información, análisis semántico y búsqueda de documentos.
|
||||||
|
|
||||||
|
El código fuente y los datos experimentales están disponibles públicamente para facilitar la reproducción y extensión de este trabajo.
|
||||||
|
|
||||||
50
docs/06_referencias_bibliograficas.md
Normal file
@@ -0,0 +1,50 @@
|
|||||||
|
# Referencias bibliográficas {.unnumbered}
|
||||||
|
|
||||||
|
Akiba, T., Sano, S., Yanase, T., Ohta, T., & Koyama, M. (2019). Optuna: A next-generation hyperparameter optimization framework. *Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining*, 2623-2631. https://doi.org/10.1145/3292500.3330701
|
||||||
|
|
||||||
|
Baek, Y., Lee, B., Han, D., Yun, S., & Lee, H. (2019). Character region awareness for text detection. *Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition*, 9365-9374. https://doi.org/10.1109/CVPR.2019.00959
|
||||||
|
|
||||||
|
Bergstra, J., & Bengio, Y. (2012). Random search for hyper-parameter optimization. *Journal of Machine Learning Research*, 13(1), 281-305. https://jmlr.org/papers/v13/bergstra12a.html
|
||||||
|
|
||||||
|
Bergstra, J., Bardenet, R., Bengio, Y., & Kégl, B. (2011). Algorithms for hyper-parameter optimization. *Advances in Neural Information Processing Systems*, 24, 2546-2554. https://papers.nips.cc/paper/2011/hash/86e8f7ab32cfd12577bc2619bc635690-Abstract.html
|
||||||
|
|
||||||
|
Cohen, J. (1988). *Statistical power analysis for the behavioral sciences* (2nd ed.). Lawrence Erlbaum Associates.
|
||||||
|
|
||||||
|
Doran, G. T. (1981). There's a S.M.A.R.T. way to write management's goals and objectives. *Management Review*, 70(11), 35-36.
|
||||||
|
|
||||||
|
Du, Y., Li, C., Guo, R., Yin, X., Liu, W., Zhou, J., Bai, Y., Yu, Z., Yang, Y., Dang, Q., & Wang, H. (2020). PP-OCR: A practical ultra lightweight OCR system. *arXiv preprint arXiv:2009.09941*. https://arxiv.org/abs/2009.09941
|
||||||
|
|
||||||
|
Du, Y., Li, C., Guo, R., Cui, C., Liu, W., Zhou, J., Lu, B., Yang, Y., Liu, Q., Hu, X., Yu, D., & Wang, H. (2023). PP-OCRv4: Mobile scene text detection and recognition. *arXiv preprint arXiv:2310.05930*. https://arxiv.org/abs/2310.05930
|
||||||
|
|
||||||
|
Feurer, M., & Hutter, F. (2019). Hyperparameter optimization. In F. Hutter, L. Kotthoff, & J. Vanschoren (Eds.), *Automated machine learning: Methods, systems, challenges* (pp. 3-33). Springer. https://doi.org/10.1007/978-3-030-05318-5_1
|
||||||
|
|
||||||
|
He, P., Huang, W., Qiao, Y., Loy, C. C., & Tang, X. (2016). Reading scene text in deep convolutional sequences. *Proceedings of the AAAI Conference on Artificial Intelligence*, 30(1), 3501-3508. https://doi.org/10.1609/aaai.v30i1.10291
|
||||||
|
|
||||||
|
JaidedAI. (2020). EasyOCR: Ready-to-use OCR with 80+ supported languages. GitHub. https://github.com/JaidedAI/EasyOCR
|
||||||
|
|
||||||
|
Liang, J., Doermann, D., & Li, H. (2005). Camera-based analysis of text and documents: A survey. *International Journal of Document Analysis and Recognition*, 7(2), 84-104. https://doi.org/10.1007/s10032-004-0138-z
|
||||||
|
|
||||||
|
Liao, M., Wan, Z., Yao, C., Chen, K., & Bai, X. (2020). Real-time scene text detection with differentiable binarization. *Proceedings of the AAAI Conference on Artificial Intelligence*, 34(07), 11474-11481. https://doi.org/10.1609/aaai.v34i07.6812
|
||||||
|
|
||||||
|
Liaw, R., Liang, E., Nishihara, R., Moritz, P., Gonzalez, J. E., & Stoica, I. (2018). Tune: A research platform for distributed model selection and training. *arXiv preprint arXiv:1807.05118*. https://arxiv.org/abs/1807.05118
|
||||||
|
|
||||||
|
Mindee. (2021). DocTR: Document Text Recognition. GitHub. https://github.com/mindee/doctr
|
||||||
|
|
||||||
|
Moritz, P., Nishihara, R., Wang, S., Tumanov, A., Liaw, R., Liang, E., Elibol, M., Yang, Z., Paul, W., Jordan, M. I., & Stoica, I. (2018). Ray: A distributed framework for emerging AI applications. *13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18)*, 561-577. https://www.usenix.org/conference/osdi18/presentation/moritz
|
||||||
|
|
||||||
|
Morris, A. C., Maier, V., & Green, P. D. (2004). From WER and RIL to MER and WIL: Improved evaluation measures for connected speech recognition. *Eighth International Conference on Spoken Language Processing*. https://doi.org/10.21437/Interspeech.2004-668
|
||||||
|
|
||||||
|
PaddlePaddle. (2024). PaddleOCR: Awesome multilingual OCR toolkits based on PaddlePaddle. GitHub. https://github.com/PaddlePaddle/PaddleOCR
|
||||||
|
|
||||||
|
Pearson, K. (1895). Notes on regression and inheritance in the case of two parents. *Proceedings of the Royal Society of London*, 58, 240-242. https://doi.org/10.1098/rspl.1895.0041
|
||||||
|
|
||||||
|
PyMuPDF. (2024). PyMuPDF documentation. https://pymupdf.readthedocs.io/
|
||||||
|
|
||||||
|
Shi, B., Bai, X., & Yao, C. (2016). An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. *IEEE Transactions on Pattern Analysis and Machine Intelligence*, 39(11), 2298-2304. https://doi.org/10.1109/TPAMI.2016.2646371
|
||||||
|
|
||||||
|
Smith, R. (2007). An overview of the Tesseract OCR engine. *Ninth International Conference on Document Analysis and Recognition (ICDAR 2007)*, 2, 629-633. https://doi.org/10.1109/ICDAR.2007.4376991
|
||||||
|
|
||||||
|
Zhou, X., Yao, C., Wen, H., Wang, Y., Zhou, S., He, W., & Liang, J. (2017). EAST: An efficient and accurate scene text detector. *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition*, 5551-5560. https://doi.org/10.1109/CVPR.2017.283
|
||||||
|
|
||||||
|
Zoph, B., & Le, Q. V. (2017). Neural architecture search with reinforcement learning. *International Conference on Learning Representations (ICLR)*. https://arxiv.org/abs/1611.01578
|
||||||
|
|
||||||
68
docs/07_anexo_a.md
Normal file
@@ -0,0 +1,68 @@
|
|||||||
|
# Anexo A. Código fuente y datos analizados {.unnumbered}
|
||||||
|
|
||||||
|
## A.1 Repositorio del Proyecto
|
||||||
|
|
||||||
|
El código fuente completo y los datos utilizados en este trabajo están disponibles en el siguiente repositorio:
|
||||||
|
|
||||||
|
**URL del repositorio:** https://github.com/seryus/MastersThesis
|
||||||
|
|
||||||
|
El repositorio incluye:
|
||||||
|
|
||||||
|
- **Notebooks de experimentación**: Código completo de los experimentos realizados
|
||||||
|
- **Scripts de evaluación**: Herramientas para evaluar modelos OCR
|
||||||
|
- **Dataset**: Imágenes y textos de referencia utilizados
|
||||||
|
- **Resultados**: Archivos CSV con los resultados de los 64 trials de Ray Tune
|
||||||
|
|
||||||
|
## A.2 Estructura del Repositorio
|
||||||
|
|
||||||
|
```mermaid
|
||||||
|
---
|
||||||
|
title: "Estructura del repositorio del proyecto"
|
||||||
|
---
|
||||||
|
flowchart LR
|
||||||
|
root["MastersThesis/"] --> docs["docs/"]
|
||||||
|
root --> src["src/"]
|
||||||
|
root --> instructions["instructions/"]
|
||||||
|
root --> scripts["Scripts generación"]
|
||||||
|
|
||||||
|
src --> nb1["paddle_ocr_fine_tune_unir_raytune.ipynb"]
|
||||||
|
src --> py1["paddle_ocr_tuning.py"]
|
||||||
|
src --> csv["raytune_paddle_subproc_results_*.csv"]
|
||||||
|
|
||||||
|
scripts --> gen1["generate_mermaid_figures.py"]
|
||||||
|
scripts --> gen2["apply_content.py"]
|
||||||
|
```
|
||||||
|
|
||||||
|
**Descripción de componentes:**
|
||||||
|
|
||||||
|
- **docs/**: Capítulos de la tesis en Markdown (estructura UNIR)
|
||||||
|
- **src/**: Código fuente de experimentación
|
||||||
|
- `paddle_ocr_fine_tune_unir_raytune.ipynb`: Notebook principal con 64 trials Ray Tune
|
||||||
|
- `paddle_ocr_tuning.py`: Script CLI para evaluación OCR
|
||||||
|
- `raytune_paddle_subproc_results_20251207_192320.csv`: Resultados de optimización
|
||||||
|
- **instructions/**: Plantilla e instrucciones UNIR
|
||||||
|
- **Scripts de generación**: `generate_mermaid_figures.py` y `apply_content.py` para generar el documento TFM
|
||||||
|
|
||||||
|
## A.3 Requisitos de Software
|
||||||
|
|
||||||
|
Para reproducir los experimentos se requieren las siguientes dependencias:
|
||||||
|
|
||||||
|
| Componente | Versión |
|
||||||
|
|------------|---------|
|
||||||
|
| Python | 3.11.9 |
|
||||||
|
| PaddlePaddle | 3.2.2 |
|
||||||
|
| PaddleOCR | 3.3.2 |
|
||||||
|
| Ray | 2.52.1 |
|
||||||
|
| Optuna | 4.6.0 |
|
||||||
|
| jiwer | (última versión) |
|
||||||
|
| PyMuPDF | (última versión) |
|
||||||
|
|
||||||
|
## A.4 Instrucciones de Ejecución
|
||||||
|
|
||||||
|
1. Clonar el repositorio
|
||||||
|
2. Instalar dependencias: `pip install -r requirements.txt`
|
||||||
|
3. Ejecutar el notebook `src/paddle_ocr_fine_tune_unir_raytune.ipynb`
|
||||||
|
|
||||||
|
## A.5 Licencia
|
||||||
|
|
||||||
|
El código se distribuye bajo licencia MIT.
|
||||||
113
generate_mermaid_figures.py
Normal file
@@ -0,0 +1,113 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""Extract Mermaid diagrams from markdown files and convert to PNG images."""
|
||||||
|
|
||||||
|
import os
|
||||||
|
import re
|
||||||
|
import subprocess
|
||||||
|
import json
|
||||||
|
|
||||||
|
BASE_DIR = '/Users/sergio/Desktop/MastersThesis'
|
||||||
|
DOCS_DIR = os.path.join(BASE_DIR, 'docs')
|
||||||
|
OUTPUT_DIR = os.path.join(BASE_DIR, 'thesis_output/figures')
|
||||||
|
MMDC = os.path.join(BASE_DIR, 'node_modules/.bin/mmdc')
|
||||||
|
|
||||||
|
def extract_mermaid_diagrams():
|
||||||
|
"""Extract all mermaid diagrams from markdown files."""
|
||||||
|
diagrams = []
|
||||||
|
|
||||||
|
md_files = [
|
||||||
|
'02_contexto_estado_arte.md',
|
||||||
|
'03_objetivos_metodologia.md',
|
||||||
|
'04_desarrollo_especifico.md',
|
||||||
|
'07_anexo_a.md',
|
||||||
|
]
|
||||||
|
|
||||||
|
for md_file in md_files:
|
||||||
|
filepath = os.path.join(DOCS_DIR, md_file)
|
||||||
|
if not os.path.exists(filepath):
|
||||||
|
continue
|
||||||
|
|
||||||
|
with open(filepath, 'r', encoding='utf-8') as f:
|
||||||
|
content = f.read()
|
||||||
|
|
||||||
|
# Find all mermaid blocks
|
||||||
|
pattern = r'```mermaid\n(.*?)```'
|
||||||
|
matches = re.findall(pattern, content, re.DOTALL)
|
||||||
|
|
||||||
|
for i, mermaid_code in enumerate(matches):
|
||||||
|
# Try to extract title from YAML front matter or inline title
|
||||||
|
title_match = re.search(r'title:\s*["\']([^"\']+)["\']', mermaid_code)
|
||||||
|
if not title_match:
|
||||||
|
title_match = re.search(r'title\s+["\']?([^"\'"\n]+)["\']?', mermaid_code)
|
||||||
|
title = title_match.group(1).strip() if title_match else f"Diagrama {len(diagrams) + 1}"
|
||||||
|
|
||||||
|
diagrams.append({
|
||||||
|
'source': md_file,
|
||||||
|
'code': mermaid_code.strip(),
|
||||||
|
'title': title,
|
||||||
|
'index': len(diagrams) + 1
|
||||||
|
})
|
||||||
|
|
||||||
|
return diagrams
|
||||||
|
|
||||||
|
def convert_to_png(diagrams):
|
||||||
|
"""Convert mermaid diagrams to PNG using mmdc."""
|
||||||
|
os.makedirs(OUTPUT_DIR, exist_ok=True)
|
||||||
|
|
||||||
|
generated = []
|
||||||
|
|
||||||
|
for diagram in diagrams:
|
||||||
|
# Write mermaid code to temp file
|
||||||
|
temp_file = os.path.join(OUTPUT_DIR, f'temp_{diagram["index"]}.mmd')
|
||||||
|
output_file = os.path.join(OUTPUT_DIR, f'figura_{diagram["index"]}.png')
|
||||||
|
|
||||||
|
with open(temp_file, 'w', encoding='utf-8') as f:
|
||||||
|
f.write(diagram['code'])
|
||||||
|
|
||||||
|
# Convert using mmdc with moderate size for page fit
|
||||||
|
try:
|
||||||
|
result = subprocess.run(
|
||||||
|
[MMDC, '-i', temp_file, '-o', output_file, '-b', 'white', '-w', '800', '-s', '1.5'],
|
||||||
|
capture_output=True,
|
||||||
|
text=True,
|
||||||
|
timeout=60
|
||||||
|
)
|
||||||
|
|
||||||
|
if os.path.exists(output_file):
|
||||||
|
print(f"✓ Generated: figura_{diagram['index']}.png - {diagram['title']}")
|
||||||
|
generated.append({
|
||||||
|
'file': f'figura_{diagram["index"]}.png',
|
||||||
|
'title': diagram['title'],
|
||||||
|
'index': diagram['index']
|
||||||
|
})
|
||||||
|
else:
|
||||||
|
print(f"✗ Failed: figura_{diagram['index']}.png - {result.stderr}")
|
||||||
|
except subprocess.TimeoutExpired:
|
||||||
|
print(f"✗ Timeout: figura_{diagram['index']}.png")
|
||||||
|
except Exception as e:
|
||||||
|
print(f"✗ Error: figura_{diagram['index']}.png - {e}")
|
||||||
|
|
||||||
|
# Clean up temp file
|
||||||
|
if os.path.exists(temp_file):
|
||||||
|
os.remove(temp_file)
|
||||||
|
|
||||||
|
return generated
|
||||||
|
|
||||||
|
def main():
|
||||||
|
print("Extracting Mermaid diagrams from markdown files...")
|
||||||
|
diagrams = extract_mermaid_diagrams()
|
||||||
|
print(f"Found {len(diagrams)} diagrams\n")
|
||||||
|
|
||||||
|
print("Converting to PNG images...")
|
||||||
|
generated = convert_to_png(diagrams)
|
||||||
|
|
||||||
|
print(f"\n✓ Generated {len(generated)} figures in {OUTPUT_DIR}")
|
||||||
|
|
||||||
|
# Save manifest for apply_content.py to use
|
||||||
|
manifest_file = os.path.join(OUTPUT_DIR, 'figures_manifest.json')
|
||||||
|
with open(manifest_file, 'w', encoding='utf-8') as f:
|
||||||
|
json.dump(generated, f, indent=2, ensure_ascii=False)
|
||||||
|
print(f"✓ Saved manifest to {manifest_file}")
|
||||||
|
|
||||||
|
if __name__ == '__main__':
|
||||||
|
main()
|
||||||
6075
instructions/plantilla_individual.htm
Normal file
BIN
instructions/plantilla_individual.pdf
Normal file
@@ -0,0 +1,2 @@
|
|||||||
|
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
|
||||||
|
<a:clrMap xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main" bg1="lt1" tx1="dk1" bg2="lt2" tx2="dk2" accent1="accent1" accent2="accent2" accent3="accent3" accent4="accent4" accent5="accent5" accent6="accent6" hlink="hlink" folHlink="folHlink"/>
|
||||||
21
instructions/plantilla_individual_files/filelist.xml
Normal file
@@ -0,0 +1,21 @@
|
|||||||
|
<xml xmlns:o="urn:schemas-microsoft-com:office:office">
|
||||||
|
<o:MainFile HRef="../plantilla_individual.htm"/>
|
||||||
|
<o:File HRef="item0001.xml"/>
|
||||||
|
<o:File HRef="props002.xml"/>
|
||||||
|
<o:File HRef="item0003.xml"/>
|
||||||
|
<o:File HRef="props004.xml"/>
|
||||||
|
<o:File HRef="item0005.xml"/>
|
||||||
|
<o:File HRef="props006.xml"/>
|
||||||
|
<o:File HRef="item0007.xml"/>
|
||||||
|
<o:File HRef="props008.xml"/>
|
||||||
|
<o:File HRef="themedata.thmx"/>
|
||||||
|
<o:File HRef="colorschememapping.xml"/>
|
||||||
|
<o:File HRef="image001.png"/>
|
||||||
|
<o:File HRef="image002.gif"/>
|
||||||
|
<o:File HRef="image003.png"/>
|
||||||
|
<o:File HRef="image004.jpg"/>
|
||||||
|
<o:File HRef="image005.png"/>
|
||||||
|
<o:File HRef="image006.gif"/>
|
||||||
|
<o:File HRef="header.htm"/>
|
||||||
|
<o:File HRef="filelist.xml"/>
|
||||||
|
</xml>
|
||||||
113
instructions/plantilla_individual_files/header.htm
Normal file
@@ -0,0 +1,113 @@
|
|||||||
|
<html xmlns:v="urn:schemas-microsoft-com:vml"
|
||||||
|
xmlns:o="urn:schemas-microsoft-com:office:office"
|
||||||
|
xmlns:w="urn:schemas-microsoft-com:office:word"
|
||||||
|
xmlns:m="http://schemas.microsoft.com/office/2004/12/omml"
|
||||||
|
xmlns="http://www.w3.org/TR/REC-html40">
|
||||||
|
|
||||||
|
<head>
|
||||||
|
<meta http-equiv=Content-Type content="text/html; charset=windows-1252">
|
||||||
|
<meta name=ProgId content=Word.Document>
|
||||||
|
<meta name=Generator content="Microsoft Word 15">
|
||||||
|
<meta name=Originator content="Microsoft Word 15">
|
||||||
|
<link id=Main-File rel=Main-File href="../plantilla_individual.htm">
|
||||||
|
<!--[if gte mso 9]><xml>
|
||||||
|
<o:shapedefaults v:ext="edit" spidmax="2050"/>
|
||||||
|
</xml><![endif]-->
|
||||||
|
</head>
|
||||||
|
|
||||||
|
<body link="#0563C1" vlink="#954F72">
|
||||||
|
|
||||||
|
<div style='mso-element:footnote-separator' id=fs>
|
||||||
|
|
||||||
|
<p class=MsoNormal><span lang=ES><span style='mso-special-character:footnote-separator'><![if !supportFootnotes]>
|
||||||
|
|
||||||
|
<hr align=left size=1 width="33%">
|
||||||
|
|
||||||
|
<![endif]></span></span></p>
|
||||||
|
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<div style='mso-element:footnote-continuation-separator' id=fcs>
|
||||||
|
|
||||||
|
<p class=MsoNormal><span lang=ES><span style='mso-special-character:footnote-continuation-separator'><![if !supportFootnotes]>
|
||||||
|
|
||||||
|
<hr align=left size=1>
|
||||||
|
|
||||||
|
<![endif]></span></span></p>
|
||||||
|
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<div style='mso-element:endnote-separator' id=es>
|
||||||
|
|
||||||
|
<p class=MsoNormal><span lang=ES><span style='mso-special-character:footnote-separator'><![if !supportFootnotes]>
|
||||||
|
|
||||||
|
<hr align=left size=1 width="33%">
|
||||||
|
|
||||||
|
<![endif]></span></span></p>
|
||||||
|
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<div style='mso-element:endnote-continuation-separator' id=ecs>
|
||||||
|
|
||||||
|
<p class=MsoNormal><span lang=ES><span style='mso-special-character:footnote-continuation-separator'><![if !supportFootnotes]>
|
||||||
|
|
||||||
|
<hr align=left size=1>
|
||||||
|
|
||||||
|
<![endif]></span></span></p>
|
||||||
|
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<div style='mso-element:header' id=eh1>
|
||||||
|
|
||||||
|
<p class=MsoHeader><span lang=ES><o:p> </o:p></span></p>
|
||||||
|
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<div style='mso-element:header' id=h1>
|
||||||
|
|
||||||
|
<p class=MsoHeader align=right style='margin:0cm;text-align:right;line-height:
|
||||||
|
normal'><span lang=ES style='font-size:10.0pt;mso-bidi-font-size:12.0pt;
|
||||||
|
font-family:"Calibri Light",sans-serif;mso-ascii-theme-font:major-latin;
|
||||||
|
mso-hansi-theme-font:major-latin;mso-bidi-font-family:"Times New Roman"'>Sergio
|
||||||
|
Jiménez <span class=SpellE>Jiménez</span><o:p></o:p></span></p>
|
||||||
|
|
||||||
|
<p class=MsoHeader align=right style='margin:0cm;text-align:right;line-height:
|
||||||
|
normal'><span lang=ES style='font-size:10.0pt;mso-bidi-font-size:12.0pt;
|
||||||
|
font-family:"Calibri Light",sans-serif;mso-ascii-theme-font:major-latin;
|
||||||
|
mso-hansi-theme-font:major-latin;mso-bidi-font-family:"Times New Roman"'>Optimización
|
||||||
|
de Hiperparámetros OCR con Ray Tune para Documentos Académicos en <span
|
||||||
|
class=GramE>Español</span><o:p></o:p></span></p>
|
||||||
|
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<div style='mso-element:footer' id=ef1>
|
||||||
|
|
||||||
|
<p class=MsoFooter><span lang=ES><o:p> </o:p></span></p>
|
||||||
|
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<div style='mso-element:footer' id=f1>
|
||||||
|
|
||||||
|
<p class=Pgina><!--[if supportFields]><span lang=ES><span style='mso-element:
|
||||||
|
field-begin'></span>PAGE<span style='mso-spacerun:yes'> </span>\* MERGEFORMAT<span
|
||||||
|
style='mso-element:field-separator'></span></span><![endif]--><span lang=ES><span
|
||||||
|
style='mso-no-proof:yes'>13</span></span><!--[if supportFields]><span lang=ES><span
|
||||||
|
style='mso-element:field-end'></span></span><![endif]--></p>
|
||||||
|
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<div style='mso-element:header' id=fh1>
|
||||||
|
|
||||||
|
<p class=MsoHeader><span lang=ES><o:p> </o:p></span></p>
|
||||||
|
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<div style='mso-element:footer' id=ff1>
|
||||||
|
|
||||||
|
<p class=MsoFooter><span lang=ES><o:p> </o:p></span></p>
|
||||||
|
|
||||||
|
</div>
|
||||||
|
|
||||||
|
</body>
|
||||||
|
|
||||||
|
</html>
|
||||||
BIN
instructions/plantilla_individual_files/image001.png
Normal file
|
After Width: | Height: | Size: 10 KiB |
BIN
instructions/plantilla_individual_files/image002.gif
Normal file
|
After Width: | Height: | Size: 3.9 KiB |
BIN
instructions/plantilla_individual_files/image003.png
Normal file
|
After Width: | Height: | Size: 23 KiB |
BIN
instructions/plantilla_individual_files/image004.jpg
Normal file
|
After Width: | Height: | Size: 16 KiB |
BIN
instructions/plantilla_individual_files/image005.png
Normal file
|
After Width: | Height: | Size: 13 KiB |
BIN
instructions/plantilla_individual_files/image006.gif
Normal file
|
After Width: | Height: | Size: 25 KiB |
258
instructions/plantilla_individual_files/item0001.xml
Normal file
@@ -0,0 +1,258 @@
|
|||||||
|
<?xml version="1.0" encoding="utf-8"?><ct:contentTypeSchema ct:_="" ma:_="" ma:contentTypeName="Documento" ma:contentTypeID="0x010100DF3D7C797EA12745A270EF30E38719B9" ma:contentTypeVersion="19" ma:contentTypeDescription="Crear nuevo documento." ma:contentTypeScope="" ma:versionID="227b02526234ef39b0b78895a9d90cf5" xmlns:ct="http://schemas.microsoft.com/office/2006/metadata/contentType" xmlns:ma="http://schemas.microsoft.com/office/2006/metadata/properties/metaAttributes">
|
||||||
|
<xsd:schema targetNamespace="http://schemas.microsoft.com/office/2006/metadata/properties" ma:root="true" ma:fieldsID="3c939c8607e2f594db8bbb23634dd059" ns2:_="" ns3:_="" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:p="http://schemas.microsoft.com/office/2006/metadata/properties" xmlns:ns2="0a70e875-3d35-4be2-921f-7117c31bab9b" xmlns:ns3="27c1adeb-3674-457c-b08c-8a73f31b6e23">
|
||||||
|
<xsd:import namespace="0a70e875-3d35-4be2-921f-7117c31bab9b"/>
|
||||||
|
<xsd:import namespace="27c1adeb-3674-457c-b08c-8a73f31b6e23"/>
|
||||||
|
<xsd:element name="properties">
|
||||||
|
<xsd:complexType>
|
||||||
|
<xsd:sequence>
|
||||||
|
<xsd:element name="documentManagement">
|
||||||
|
<xsd:complexType>
|
||||||
|
<xsd:all>
|
||||||
|
<xsd:element ref="ns2:SharedWithUsers" minOccurs="0"/>
|
||||||
|
<xsd:element ref="ns2:SharedWithDetails" minOccurs="0"/>
|
||||||
|
<xsd:element ref="ns3:MediaServiceMetadata" minOccurs="0"/>
|
||||||
|
<xsd:element ref="ns3:MediaServiceFastMetadata" minOccurs="0"/>
|
||||||
|
<xsd:element ref="ns3:MediaServiceAutoKeyPoints" minOccurs="0"/>
|
||||||
|
<xsd:element ref="ns3:MediaServiceKeyPoints" minOccurs="0"/>
|
||||||
|
<xsd:element ref="ns3:MediaServiceAutoTags" minOccurs="0"/>
|
||||||
|
<xsd:element ref="ns3:MediaServiceOCR" minOccurs="0"/>
|
||||||
|
<xsd:element ref="ns3:MediaServiceGenerationTime" minOccurs="0"/>
|
||||||
|
<xsd:element ref="ns3:MediaServiceEventHashCode" minOccurs="0"/>
|
||||||
|
<xsd:element ref="ns3:MediaServiceDateTaken" minOccurs="0"/>
|
||||||
|
<xsd:element ref="ns3:MediaLengthInSeconds" minOccurs="0"/>
|
||||||
|
<xsd:element ref="ns3:MediaServiceLocation" minOccurs="0"/>
|
||||||
|
<xsd:element ref="ns3:lcf76f155ced4ddcb4097134ff3c332f" minOccurs="0"/>
|
||||||
|
<xsd:element ref="ns2:TaxCatchAll" minOccurs="0"/>
|
||||||
|
<xsd:element ref="ns3:MediaServiceSearchProperties" minOccurs="0"/>
|
||||||
|
<xsd:element ref="ns3:_Flow_SignoffStatus" minOccurs="0"/>
|
||||||
|
<xsd:element ref="ns3:MediaServiceObjectDetectorVersions" minOccurs="0"/>
|
||||||
|
</xsd:all>
|
||||||
|
</xsd:complexType>
|
||||||
|
</xsd:element>
|
||||||
|
</xsd:sequence>
|
||||||
|
</xsd:complexType>
|
||||||
|
</xsd:element>
|
||||||
|
</xsd:schema>
|
||||||
|
<xsd:schema targetNamespace="0a70e875-3d35-4be2-921f-7117c31bab9b" elementFormDefault="qualified" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:dms="http://schemas.microsoft.com/office/2006/documentManagement/types" xmlns:pc="http://schemas.microsoft.com/office/infopath/2007/PartnerControls">
|
||||||
|
<xsd:import namespace="http://schemas.microsoft.com/office/2006/documentManagement/types"/>
|
||||||
|
<xsd:import namespace="http://schemas.microsoft.com/office/infopath/2007/PartnerControls"/>
|
||||||
|
<xsd:element name="SharedWithUsers" ma:index="8" nillable="true" ma:displayName="Compartido con" ma:internalName="SharedWithUsers" ma:readOnly="true">
|
||||||
|
<xsd:complexType>
|
||||||
|
<xsd:complexContent>
|
||||||
|
<xsd:extension base="dms:UserMulti">
|
||||||
|
<xsd:sequence>
|
||||||
|
<xsd:element name="UserInfo" minOccurs="0" maxOccurs="unbounded">
|
||||||
|
<xsd:complexType>
|
||||||
|
<xsd:sequence>
|
||||||
|
<xsd:element name="DisplayName" type="xsd:string" minOccurs="0"/>
|
||||||
|
<xsd:element name="AccountId" type="dms:UserId" minOccurs="0" nillable="true"/>
|
||||||
|
<xsd:element name="AccountType" type="xsd:string" minOccurs="0"/>
|
||||||
|
</xsd:sequence>
|
||||||
|
</xsd:complexType>
|
||||||
|
</xsd:element>
|
||||||
|
</xsd:sequence>
|
||||||
|
</xsd:extension>
|
||||||
|
</xsd:complexContent>
|
||||||
|
</xsd:complexType>
|
||||||
|
</xsd:element>
|
||||||
|
<xsd:element name="SharedWithDetails" ma:index="9" nillable="true" ma:displayName="Detalles de uso compartido" ma:internalName="SharedWithDetails" ma:readOnly="true">
|
||||||
|
<xsd:simpleType>
|
||||||
|
<xsd:restriction base="dms:Note">
|
||||||
|
<xsd:maxLength value="255"/>
|
||||||
|
</xsd:restriction>
|
||||||
|
</xsd:simpleType>
|
||||||
|
</xsd:element>
|
||||||
|
<xsd:element name="TaxCatchAll" ma:index="23" nillable="true" ma:displayName="Taxonomy Catch All Column" ma:hidden="true" ma:list="{c7f67346-78c9-4c4d-b954-8d350fdf60db}" ma:internalName="TaxCatchAll" ma:showField="CatchAllData" ma:web="0a70e875-3d35-4be2-921f-7117c31bab9b">
|
||||||
|
<xsd:complexType>
|
||||||
|
<xsd:complexContent>
|
||||||
|
<xsd:extension base="dms:MultiChoiceLookup">
|
||||||
|
<xsd:sequence>
|
||||||
|
<xsd:element name="Value" type="dms:Lookup" maxOccurs="unbounded" minOccurs="0" nillable="true"/>
|
||||||
|
</xsd:sequence>
|
||||||
|
</xsd:extension>
|
||||||
|
</xsd:complexContent>
|
||||||
|
</xsd:complexType>
|
||||||
|
</xsd:element>
|
||||||
|
</xsd:schema>
|
||||||
|
<xsd:schema targetNamespace="27c1adeb-3674-457c-b08c-8a73f31b6e23" elementFormDefault="qualified" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:dms="http://schemas.microsoft.com/office/2006/documentManagement/types" xmlns:pc="http://schemas.microsoft.com/office/infopath/2007/PartnerControls">
|
||||||
|
<xsd:import namespace="http://schemas.microsoft.com/office/2006/documentManagement/types"/>
|
||||||
|
<xsd:import namespace="http://schemas.microsoft.com/office/infopath/2007/PartnerControls"/>
|
||||||
|
<xsd:element name="MediaServiceMetadata" ma:index="10" nillable="true" ma:displayName="MediaServiceMetadata" ma:hidden="true" ma:internalName="MediaServiceMetadata" ma:readOnly="true">
|
||||||
|
<xsd:simpleType>
|
||||||
|
<xsd:restriction base="dms:Note"/>
|
||||||
|
</xsd:simpleType>
|
||||||
|
</xsd:element>
|
||||||
|
<xsd:element name="MediaServiceFastMetadata" ma:index="11" nillable="true" ma:displayName="MediaServiceFastMetadata" ma:hidden="true" ma:internalName="MediaServiceFastMetadata" ma:readOnly="true">
|
||||||
|
<xsd:simpleType>
|
||||||
|
<xsd:restriction base="dms:Note"/>
|
||||||
|
</xsd:simpleType>
|
||||||
|
</xsd:element>
|
||||||
|
<xsd:element name="MediaServiceAutoKeyPoints" ma:index="12" nillable="true" ma:displayName="MediaServiceAutoKeyPoints" ma:hidden="true" ma:internalName="MediaServiceAutoKeyPoints" ma:readOnly="true">
|
||||||
|
<xsd:simpleType>
|
||||||
|
<xsd:restriction base="dms:Note"/>
|
||||||
|
</xsd:simpleType>
|
||||||
|
</xsd:element>
|
||||||
|
<xsd:element name="MediaServiceKeyPoints" ma:index="13" nillable="true" ma:displayName="KeyPoints" ma:internalName="MediaServiceKeyPoints" ma:readOnly="true">
|
||||||
|
<xsd:simpleType>
|
||||||
|
<xsd:restriction base="dms:Note">
|
||||||
|
<xsd:maxLength value="255"/>
|
||||||
|
</xsd:restriction>
|
||||||
|
</xsd:simpleType>
|
||||||
|
</xsd:element>
|
||||||
|
<xsd:element name="MediaServiceAutoTags" ma:index="14" nillable="true" ma:displayName="Tags" ma:internalName="MediaServiceAutoTags" ma:readOnly="true">
|
||||||
|
<xsd:simpleType>
|
||||||
|
<xsd:restriction base="dms:Text"/>
|
||||||
|
</xsd:simpleType>
|
||||||
|
</xsd:element>
|
||||||
|
<xsd:element name="MediaServiceOCR" ma:index="15" nillable="true" ma:displayName="Extracted Text" ma:internalName="MediaServiceOCR" ma:readOnly="true">
|
||||||
|
<xsd:simpleType>
|
||||||
|
<xsd:restriction base="dms:Note">
|
||||||
|
<xsd:maxLength value="255"/>
|
||||||
|
</xsd:restriction>
|
||||||
|
</xsd:simpleType>
|
||||||
|
</xsd:element>
|
||||||
|
<xsd:element name="MediaServiceGenerationTime" ma:index="16" nillable="true" ma:displayName="MediaServiceGenerationTime" ma:hidden="true" ma:internalName="MediaServiceGenerationTime" ma:readOnly="true">
|
||||||
|
<xsd:simpleType>
|
||||||
|
<xsd:restriction base="dms:Text"/>
|
||||||
|
</xsd:simpleType>
|
||||||
|
</xsd:element>
|
||||||
|
<xsd:element name="MediaServiceEventHashCode" ma:index="17" nillable="true" ma:displayName="MediaServiceEventHashCode" ma:hidden="true" ma:internalName="MediaServiceEventHashCode" ma:readOnly="true">
|
||||||
|
<xsd:simpleType>
|
||||||
|
<xsd:restriction base="dms:Text"/>
|
||||||
|
</xsd:simpleType>
|
||||||
|
</xsd:element>
|
||||||
|
<xsd:element name="MediaServiceDateTaken" ma:index="18" nillable="true" ma:displayName="MediaServiceDateTaken" ma:hidden="true" ma:internalName="MediaServiceDateTaken" ma:readOnly="true">
|
||||||
|
<xsd:simpleType>
|
||||||
|
<xsd:restriction base="dms:Text"/>
|
||||||
|
</xsd:simpleType>
|
||||||
|
</xsd:element>
|
||||||
|
<xsd:element name="MediaLengthInSeconds" ma:index="19" nillable="true" ma:displayName="Length (seconds)" ma:internalName="MediaLengthInSeconds" ma:readOnly="true">
|
||||||
|
<xsd:simpleType>
|
||||||
|
<xsd:restriction base="dms:Unknown"/>
|
||||||
|
</xsd:simpleType>
|
||||||
|
</xsd:element>
|
||||||
|
<xsd:element name="MediaServiceLocation" ma:index="20" nillable="true" ma:displayName="Location" ma:internalName="MediaServiceLocation" ma:readOnly="true">
|
||||||
|
<xsd:simpleType>
|
||||||
|
<xsd:restriction base="dms:Text"/>
|
||||||
|
</xsd:simpleType>
|
||||||
|
</xsd:element>
|
||||||
|
<xsd:element name="lcf76f155ced4ddcb4097134ff3c332f" ma:index="22" nillable="true" ma:taxonomy="true" ma:internalName="lcf76f155ced4ddcb4097134ff3c332f" ma:taxonomyFieldName="MediaServiceImageTags" ma:displayName="Etiquetas de imagen" ma:readOnly="false" ma:fieldId="{5cf76f15-5ced-4ddc-b409-7134ff3c332f}" ma:taxonomyMulti="true" ma:sspId="17631b59-e624-4eb7-963c-219f14f887a3" ma:termSetId="09814cd3-568e-fe90-9814-8d621ff8fb84" ma:anchorId="fba54fb3-c3e1-fe81-a776-ca4b69148c4d" ma:open="true" ma:isKeyword="false">
|
||||||
|
<xsd:complexType>
|
||||||
|
<xsd:sequence>
|
||||||
|
<xsd:element ref="pc:Terms" minOccurs="0" maxOccurs="1"></xsd:element>
|
||||||
|
</xsd:sequence>
|
||||||
|
</xsd:complexType>
|
||||||
|
</xsd:element>
|
||||||
|
<xsd:element name="MediaServiceSearchProperties" ma:index="24" nillable="true" ma:displayName="MediaServiceSearchProperties" ma:hidden="true" ma:internalName="MediaServiceSearchProperties" ma:readOnly="true">
|
||||||
|
<xsd:simpleType>
|
||||||
|
<xsd:restriction base="dms:Note"/>
|
||||||
|
</xsd:simpleType>
|
||||||
|
</xsd:element>
|
||||||
|
<xsd:element name="_Flow_SignoffStatus" ma:index="25" nillable="true" ma:displayName="Estado de aprobación" ma:internalName="Estado_x0020_de_x0020_aprobaci_x00f3_n">
|
||||||
|
<xsd:simpleType>
|
||||||
|
<xsd:restriction base="dms:Text"/>
|
||||||
|
</xsd:simpleType>
|
||||||
|
</xsd:element>
|
||||||
|
<xsd:element name="MediaServiceObjectDetectorVersions" ma:index="26" nillable="true" ma:displayName="MediaServiceObjectDetectorVersions" ma:description="" ma:hidden="true" ma:indexed="true" ma:internalName="MediaServiceObjectDetectorVersions" ma:readOnly="true">
|
||||||
|
<xsd:simpleType>
|
||||||
|
<xsd:restriction base="dms:Text"/>
|
||||||
|
</xsd:simpleType>
|
||||||
|
</xsd:element>
|
||||||
|
</xsd:schema>
|
||||||
|
<xsd:schema targetNamespace="http://schemas.openxmlformats.org/package/2006/metadata/core-properties" elementFormDefault="qualified" attributeFormDefault="unqualified" blockDefault="#all" xmlns="http://schemas.openxmlformats.org/package/2006/metadata/core-properties" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:odoc="http://schemas.microsoft.com/internal/obd">
|
||||||
|
<xsd:import namespace="http://purl.org/dc/elements/1.1/" schemaLocation="http://dublincore.org/schemas/xmls/qdc/2003/04/02/dc.xsd"/>
|
||||||
|
<xsd:import namespace="http://purl.org/dc/terms/" schemaLocation="http://dublincore.org/schemas/xmls/qdc/2003/04/02/dcterms.xsd"/>
|
||||||
|
<xsd:element name="coreProperties" type="CT_coreProperties"/>
|
||||||
|
<xsd:complexType name="CT_coreProperties">
|
||||||
|
<xsd:all>
|
||||||
|
<xsd:element ref="dc:creator" minOccurs="0" maxOccurs="1"/>
|
||||||
|
<xsd:element ref="dcterms:created" minOccurs="0" maxOccurs="1"/>
|
||||||
|
<xsd:element ref="dc:identifier" minOccurs="0" maxOccurs="1"/>
|
||||||
|
<xsd:element name="contentType" minOccurs="0" maxOccurs="1" type="xsd:string" ma:index="0" ma:displayName="Tipo de contenido"/>
|
||||||
|
<xsd:element ref="dc:title" minOccurs="0" maxOccurs="1" ma:index="4" ma:displayName="Título"/>
|
||||||
|
<xsd:element ref="dc:subject" minOccurs="0" maxOccurs="1"/>
|
||||||
|
<xsd:element ref="dc:description" minOccurs="0" maxOccurs="1"/>
|
||||||
|
<xsd:element name="keywords" minOccurs="0" maxOccurs="1" type="xsd:string"/>
|
||||||
|
<xsd:element ref="dc:language" minOccurs="0" maxOccurs="1"/>
|
||||||
|
<xsd:element name="category" minOccurs="0" maxOccurs="1" type="xsd:string"/>
|
||||||
|
<xsd:element name="version" minOccurs="0" maxOccurs="1" type="xsd:string"/>
|
||||||
|
<xsd:element name="revision" minOccurs="0" maxOccurs="1" type="xsd:string">
|
||||||
|
<xsd:annotation>
|
||||||
|
<xsd:documentation>
|
||||||
|
This value indicates the number of saves or revisions. The application is responsible for updating this value after each revision.
|
||||||
|
</xsd:documentation>
|
||||||
|
</xsd:annotation>
|
||||||
|
</xsd:element>
|
||||||
|
<xsd:element name="lastModifiedBy" minOccurs="0" maxOccurs="1" type="xsd:string"/>
|
||||||
|
<xsd:element ref="dcterms:modified" minOccurs="0" maxOccurs="1"/>
|
||||||
|
<xsd:element name="contentStatus" minOccurs="0" maxOccurs="1" type="xsd:string"/>
|
||||||
|
</xsd:all>
|
||||||
|
</xsd:complexType>
|
||||||
|
</xsd:schema>
|
||||||
|
<xs:schema targetNamespace="http://schemas.microsoft.com/office/infopath/2007/PartnerControls" elementFormDefault="qualified" attributeFormDefault="unqualified" xmlns:pc="http://schemas.microsoft.com/office/infopath/2007/PartnerControls" xmlns:xs="http://www.w3.org/2001/XMLSchema">
|
||||||
|
<xs:element name="Person">
|
||||||
|
<xs:complexType>
|
||||||
|
<xs:sequence>
|
||||||
|
<xs:element ref="pc:DisplayName" minOccurs="0"></xs:element>
|
||||||
|
<xs:element ref="pc:AccountId" minOccurs="0"></xs:element>
|
||||||
|
<xs:element ref="pc:AccountType" minOccurs="0"></xs:element>
|
||||||
|
</xs:sequence>
|
||||||
|
</xs:complexType>
|
||||||
|
</xs:element>
|
||||||
|
<xs:element name="DisplayName" type="xs:string"></xs:element>
|
||||||
|
<xs:element name="AccountId" type="xs:string"></xs:element>
|
||||||
|
<xs:element name="AccountType" type="xs:string"></xs:element>
|
||||||
|
<xs:element name="BDCAssociatedEntity">
|
||||||
|
<xs:complexType>
|
||||||
|
<xs:sequence>
|
||||||
|
<xs:element ref="pc:BDCEntity" minOccurs="0" maxOccurs="unbounded"></xs:element>
|
||||||
|
</xs:sequence>
|
||||||
|
<xs:attribute ref="pc:EntityNamespace"></xs:attribute>
|
||||||
|
<xs:attribute ref="pc:EntityName"></xs:attribute>
|
||||||
|
<xs:attribute ref="pc:SystemInstanceName"></xs:attribute>
|
||||||
|
<xs:attribute ref="pc:AssociationName"></xs:attribute>
|
||||||
|
</xs:complexType>
|
||||||
|
</xs:element>
|
||||||
|
<xs:attribute name="EntityNamespace" type="xs:string"></xs:attribute>
|
||||||
|
<xs:attribute name="EntityName" type="xs:string"></xs:attribute>
|
||||||
|
<xs:attribute name="SystemInstanceName" type="xs:string"></xs:attribute>
|
||||||
|
<xs:attribute name="AssociationName" type="xs:string"></xs:attribute>
|
||||||
|
<xs:element name="BDCEntity">
|
||||||
|
<xs:complexType>
|
||||||
|
<xs:sequence>
|
||||||
|
<xs:element ref="pc:EntityDisplayName" minOccurs="0"></xs:element>
|
||||||
|
<xs:element ref="pc:EntityInstanceReference" minOccurs="0"></xs:element>
|
||||||
|
<xs:element ref="pc:EntityId1" minOccurs="0"></xs:element>
|
||||||
|
<xs:element ref="pc:EntityId2" minOccurs="0"></xs:element>
|
||||||
|
<xs:element ref="pc:EntityId3" minOccurs="0"></xs:element>
|
||||||
|
<xs:element ref="pc:EntityId4" minOccurs="0"></xs:element>
|
||||||
|
<xs:element ref="pc:EntityId5" minOccurs="0"></xs:element>
|
||||||
|
</xs:sequence>
|
||||||
|
</xs:complexType>
|
||||||
|
</xs:element>
|
||||||
|
<xs:element name="EntityDisplayName" type="xs:string"></xs:element>
|
||||||
|
<xs:element name="EntityInstanceReference" type="xs:string"></xs:element>
|
||||||
|
<xs:element name="EntityId1" type="xs:string"></xs:element>
|
||||||
|
<xs:element name="EntityId2" type="xs:string"></xs:element>
|
||||||
|
<xs:element name="EntityId3" type="xs:string"></xs:element>
|
||||||
|
<xs:element name="EntityId4" type="xs:string"></xs:element>
|
||||||
|
<xs:element name="EntityId5" type="xs:string"></xs:element>
|
||||||
|
<xs:element name="Terms">
|
||||||
|
<xs:complexType>
|
||||||
|
<xs:sequence>
|
||||||
|
<xs:element ref="pc:TermInfo" minOccurs="0" maxOccurs="unbounded"></xs:element>
|
||||||
|
</xs:sequence>
|
||||||
|
</xs:complexType>
|
||||||
|
</xs:element>
|
||||||
|
<xs:element name="TermInfo">
|
||||||
|
<xs:complexType>
|
||||||
|
<xs:sequence>
|
||||||
|
<xs:element ref="pc:TermName" minOccurs="0"></xs:element>
|
||||||
|
<xs:element ref="pc:TermId" minOccurs="0"></xs:element>
|
||||||
|
</xs:sequence>
|
||||||
|
</xs:complexType>
|
||||||
|
</xs:element>
|
||||||
|
<xs:element name="TermName" type="xs:string"></xs:element>
|
||||||
|
<xs:element name="TermId" type="xs:string"></xs:element>
|
||||||
|
</xs:schema>
|
||||||
|
</ct:contentTypeSchema>
|
||||||
1
instructions/plantilla_individual_files/item0003.xml
Normal file
@@ -0,0 +1 @@
|
|||||||
|
<?xml version="1.0" encoding="UTF-8" standalone="no"?><b:Sources SelectedStyle="\APASixthEditionOfficeOnline.xsl" StyleName="APA" Version="6" xmlns:b="http://schemas.openxmlformats.org/officeDocument/2006/bibliography" xmlns="http://schemas.openxmlformats.org/officeDocument/2006/bibliography"><b:Source><b:Tag>Dor81</b:Tag><b:SourceType>JournalArticle</b:SourceType><b:Guid>{D7C468B5-5E32-4254-9330-6DB2DDB01037}</b:Guid><b:Title>There's a S.M.A.R.T. way to write management's goals and objectives</b:Title><b:Year>1981</b:Year><b:Author><b:Author><b:NameList><b:Person><b:Last>Doran</b:Last><b:First>G.</b:First><b:Middle>T.</b:Middle></b:Person></b:NameList></b:Author></b:Author><b:JournalName>Management Review (AMA FORUM)</b:JournalName><b:Pages>35-36</b:Pages><b:Volume>70</b:Volume><b:RefOrder>1</b:RefOrder></b:Source></b:Sources>
|
||||||
1
instructions/plantilla_individual_files/item0005.xml
Normal file
@@ -0,0 +1 @@
|
|||||||
|
<?xml version="1.0" encoding="utf-8"?><p:properties xmlns:p="http://schemas.microsoft.com/office/2006/metadata/properties" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:pc="http://schemas.microsoft.com/office/infopath/2007/PartnerControls"><documentManagement><lcf76f155ced4ddcb4097134ff3c332f xmlns="27c1adeb-3674-457c-b08c-8a73f31b6e23"><Terms xmlns="http://schemas.microsoft.com/office/infopath/2007/PartnerControls"></Terms></lcf76f155ced4ddcb4097134ff3c332f><TaxCatchAll xmlns="0a70e875-3d35-4be2-921f-7117c31bab9b" xsi:nil="true"/><_Flow_SignoffStatus xmlns="27c1adeb-3674-457c-b08c-8a73f31b6e23" xsi:nil="true"/></documentManagement></p:properties>
|
||||||
1
instructions/plantilla_individual_files/item0007.xml
Normal file
@@ -0,0 +1 @@
|
|||||||
|
<?mso-contentType?><FormTemplates xmlns="http://schemas.microsoft.com/sharepoint/v3/contenttype/forms"><Display>DocumentLibraryForm</Display><Edit>DocumentLibraryForm</Edit><New>DocumentLibraryForm</New></FormTemplates>
|
||||||
2
instructions/plantilla_individual_files/props002.xml
Normal file
@@ -0,0 +1,2 @@
|
|||||||
|
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
|
||||||
|
<ds:datastoreItem ds:itemID="{B3A822E2-E694-47D5-9E22-DA4B12671ABB}" xmlns:ds="http://schemas.openxmlformats.org/officeDocument/2006/customXml"><ds:schemaRefs><ds:schemaRef ds:uri="http://schemas.microsoft.com/office/2006/metadata/contentType"/><ds:schemaRef ds:uri="http://schemas.microsoft.com/office/2006/metadata/properties/metaAttributes"/><ds:schemaRef ds:uri="http://www.w3.org/2001/XMLSchema"/><ds:schemaRef ds:uri="http://schemas.microsoft.com/office/2006/metadata/properties"/><ds:schemaRef ds:uri="0a70e875-3d35-4be2-921f-7117c31bab9b"/><ds:schemaRef ds:uri="27c1adeb-3674-457c-b08c-8a73f31b6e23"/><ds:schemaRef ds:uri="http://schemas.microsoft.com/office/2006/documentManagement/types"/><ds:schemaRef ds:uri="http://schemas.microsoft.com/office/infopath/2007/PartnerControls"/><ds:schemaRef ds:uri="http://schemas.openxmlformats.org/package/2006/metadata/core-properties"/><ds:schemaRef ds:uri="http://purl.org/dc/elements/1.1/"/><ds:schemaRef ds:uri="http://purl.org/dc/terms/"/><ds:schemaRef ds:uri="http://schemas.microsoft.com/internal/obd"/></ds:schemaRefs></ds:datastoreItem>
|
||||||
2
instructions/plantilla_individual_files/props004.xml
Normal file
@@ -0,0 +1,2 @@
|
|||||||
|
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
|
||||||
|
<ds:datastoreItem ds:itemID="{3CBD5336-2C2D-4DA8-8EBD-C205328B54AF}" xmlns:ds="http://schemas.openxmlformats.org/officeDocument/2006/customXml"><ds:schemaRefs><ds:schemaRef ds:uri="http://schemas.openxmlformats.org/officeDocument/2006/bibliography"/></ds:schemaRefs></ds:datastoreItem>
|
||||||
2
instructions/plantilla_individual_files/props006.xml
Normal file
@@ -0,0 +1,2 @@
|
|||||||
|
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
|
||||||
|
<ds:datastoreItem ds:itemID="{DB456AF2-52F5-44D8-AEC6-B5F9D96C377E}" xmlns:ds="http://schemas.openxmlformats.org/officeDocument/2006/customXml"><ds:schemaRefs><ds:schemaRef ds:uri="http://schemas.microsoft.com/office/2006/metadata/properties"/><ds:schemaRef ds:uri="http://schemas.microsoft.com/office/infopath/2007/PartnerControls"/><ds:schemaRef ds:uri="27c1adeb-3674-457c-b08c-8a73f31b6e23"/><ds:schemaRef ds:uri="0a70e875-3d35-4be2-921f-7117c31bab9b"/></ds:schemaRefs></ds:datastoreItem>
|
||||||
2
instructions/plantilla_individual_files/props008.xml
Normal file
@@ -0,0 +1,2 @@
|
|||||||
|
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
|
||||||
|
<ds:datastoreItem ds:itemID="{BE74C307-52FE-48C3-92C2-E1552852BAAA}" xmlns:ds="http://schemas.openxmlformats.org/officeDocument/2006/customXml"><ds:schemaRefs><ds:schemaRef ds:uri="http://schemas.microsoft.com/sharepoint/v3/contenttype/forms"/></ds:schemaRefs></ds:datastoreItem>
|
||||||
BIN
instructions/plantilla_individual_files/themedata.thmx
Normal file
4127
package-lock.json
generated
Normal file
5
package.json
Normal file
@@ -0,0 +1,5 @@
|
|||||||
|
{
|
||||||
|
"dependencies": {
|
||||||
|
"@mermaid-js/mermaid-cli": "^11.12.0"
|
||||||
|
}
|
||||||
|
}
|
||||||
45
src/dataset_manager.py
Normal file
@@ -0,0 +1,45 @@
|
|||||||
|
# Imports
|
||||||
|
import os
|
||||||
|
from PIL import Image
|
||||||
|
|
||||||
|
|
||||||
|
class ImageTextDataset:
|
||||||
|
def __init__(self, root):
|
||||||
|
self.samples = []
|
||||||
|
|
||||||
|
for folder in sorted(os.listdir(root)):
|
||||||
|
sub = os.path.join(root, folder)
|
||||||
|
img_dir = os.path.join(sub, "img")
|
||||||
|
txt_dir = os.path.join(sub, "txt")
|
||||||
|
|
||||||
|
if not (os.path.isdir(img_dir) and os.path.isdir(txt_dir)):
|
||||||
|
continue
|
||||||
|
|
||||||
|
for fname in sorted(os.listdir(img_dir)):
|
||||||
|
if not fname.lower().endswith((".png", ".jpg", ".jpeg")):
|
||||||
|
continue
|
||||||
|
|
||||||
|
img_path = os.path.join(img_dir, fname)
|
||||||
|
|
||||||
|
# text file must have same name but .txt
|
||||||
|
txt_name = os.path.splitext(fname)[0] + ".txt"
|
||||||
|
txt_path = os.path.join(txt_dir, txt_name)
|
||||||
|
|
||||||
|
if not os.path.exists(txt_path):
|
||||||
|
continue
|
||||||
|
|
||||||
|
self.samples.append((img_path, txt_path))
|
||||||
|
def __len__(self):
|
||||||
|
return len(self.samples)
|
||||||
|
|
||||||
|
def __getitem__(self, idx):
|
||||||
|
img_path, txt_path = self.samples[idx]
|
||||||
|
|
||||||
|
# Load image
|
||||||
|
image = Image.open(img_path).convert("RGB")
|
||||||
|
|
||||||
|
# Load text
|
||||||
|
with open(txt_path, "r", encoding="utf-8") as f:
|
||||||
|
text = f.read()
|
||||||
|
|
||||||
|
return image, text
|
||||||
2772
src/paddle_ocr_fine_tune_unir_raytune.ipynb
Normal file
@@ -1,95 +1,16 @@
|
|||||||
# Imports
|
# Imports
|
||||||
import argparse, json, os, sys, time
|
import argparse, json, time, re
|
||||||
from typing import List
|
|
||||||
import numpy as np
|
import numpy as np
|
||||||
from PIL import Image
|
|
||||||
import fitz # PyMuPDF
|
|
||||||
from paddleocr import PaddleOCR
|
from paddleocr import PaddleOCR
|
||||||
import re
|
|
||||||
from jiwer import wer, cer
|
from jiwer import wer, cer
|
||||||
|
from dataset_manager import ImageTextDataset
|
||||||
|
from itertools import islice
|
||||||
|
|
||||||
def export_config(paddleocr_model):
|
def export_config(paddleocr_model):
|
||||||
yaml_path = "paddleocr_pipeline_dump.yaml"
|
yaml_path = "paddleocr_pipeline_dump.yaml"
|
||||||
paddleocr_model.export_paddlex_config_to_yaml(yaml_path)
|
paddleocr_model.export_paddlex_config_to_yaml(yaml_path)
|
||||||
print("Exported:", yaml_path)
|
print("Exported:", yaml_path)
|
||||||
|
|
||||||
def pdf_to_images(pdf_path: str, dpi: int = 300, pages: List[int] = None) -> List[Image.Image]:
|
|
||||||
"""
|
|
||||||
Render a PDF into a list of PIL Images using PyMuPDF or pdf2image.
|
|
||||||
'pages' is 1-based (e.g., range(1, 10) -> pages 1–9).
|
|
||||||
"""
|
|
||||||
images = []
|
|
||||||
|
|
||||||
if fitz is not None:
|
|
||||||
doc = fitz.open(pdf_path)
|
|
||||||
total_pages = len(doc)
|
|
||||||
|
|
||||||
# Adjust page indices (PyMuPDF uses 0-based indexing)
|
|
||||||
if pages is None:
|
|
||||||
page_indices = list(range(total_pages))
|
|
||||||
else:
|
|
||||||
# Filter out invalid pages and convert to 0-based
|
|
||||||
page_indices = [p - 1 for p in pages if 1 <= p <= total_pages]
|
|
||||||
|
|
||||||
for i in page_indices:
|
|
||||||
page = doc.load_page(i)
|
|
||||||
mat = fitz.Matrix(dpi / 72.0, dpi / 72.0)
|
|
||||||
pix = page.get_pixmap(matrix=mat, alpha=False)
|
|
||||||
img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
|
|
||||||
|
|
||||||
images.append(img)
|
|
||||||
doc.close()
|
|
||||||
else:
|
|
||||||
raise RuntimeError("Install PyMuPDF or pdf2image to convert PDFs.")
|
|
||||||
|
|
||||||
return images
|
|
||||||
|
|
||||||
|
|
||||||
def pdf_extract_text(pdf_path, page_num, line_tolerance=15) -> str:
|
|
||||||
"""
|
|
||||||
Extracts text from a specific PDF page in proper reading order.
|
|
||||||
Adds '\n' when blocks are vertically separated more than line_tolerance.
|
|
||||||
Removes bullet-like characters (, •, ▪, etc.).
|
|
||||||
"""
|
|
||||||
doc = fitz.open(pdf_path)
|
|
||||||
|
|
||||||
if page_num < 1 or page_num > len(doc):
|
|
||||||
return ""
|
|
||||||
|
|
||||||
page = doc[page_num - 1]
|
|
||||||
blocks = page.get_text("blocks") # (x0, y0, x1, y1, text, block_no, block_type)
|
|
||||||
|
|
||||||
# Sort blocks: top-to-bottom, left-to-right
|
|
||||||
blocks_sorted = sorted(blocks, key=lambda b: (b[1], b[0]))
|
|
||||||
|
|
||||||
text_lines = []
|
|
||||||
last_y = None
|
|
||||||
|
|
||||||
for b in blocks_sorted:
|
|
||||||
y0 = b[1]
|
|
||||||
text_block = b[4].strip()
|
|
||||||
|
|
||||||
# Remove bullet-like characters
|
|
||||||
text_block = re.sub(r"[•▪◦●❖▶■]", "", text_block)
|
|
||||||
|
|
||||||
# If new line (based on vertical gap)
|
|
||||||
if last_y is not None and abs(y0 - last_y) > line_tolerance:
|
|
||||||
text_lines.append("") # blank line for spacing
|
|
||||||
|
|
||||||
text_lines.append(text_block.strip())
|
|
||||||
last_y = y0
|
|
||||||
|
|
||||||
# Join all lines with real newlines
|
|
||||||
text = "\n".join(text_lines)
|
|
||||||
|
|
||||||
# Normalize spaces
|
|
||||||
text = re.sub(r"\s*\n\s*", "\n", text).strip() # remove spaces around newlines
|
|
||||||
text = re.sub(r" +", " ", text).strip() # collapse multiple spaces to one
|
|
||||||
text = re.sub(r"\n{3,}", "\n\n", text).strip() # avoid triple blank lines
|
|
||||||
|
|
||||||
doc.close()
|
|
||||||
return text
|
|
||||||
|
|
||||||
def evaluate_text(reference, prediction):
|
def evaluate_text(reference, prediction):
|
||||||
return {'WER': wer(reference, prediction), 'CER': cer(reference, prediction)}
|
return {'WER': wer(reference, prediction), 'CER': cer(reference, prediction)}
|
||||||
|
|
||||||
@@ -189,18 +110,25 @@ def assemble_from_paddle_result(paddleocr_predict, min_score=0.0, line_tol_facto
|
|||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
def main():
|
def main():
|
||||||
parser = argparse.ArgumentParser()
|
parser = argparse.ArgumentParser()
|
||||||
|
# dataset root folder
|
||||||
parser.add_argument("--pdf-folder", required=True)
|
parser.add_argument("--pdf-folder", required=True)
|
||||||
parser.add_argument("--dpi", type=int, default=300)
|
#Whether to use document image orientation classification.
|
||||||
|
parser.add_argument("--use-doc-orientation-classify", type=lambda s: s.lower()=="true", default=False)
|
||||||
|
# Whether to use text image unwarping.
|
||||||
|
parser.add_argument("--use-doc-unwarping", type=lambda s: s.lower()=="true", default=False)
|
||||||
|
# Whether to use text line orientation classification.
|
||||||
parser.add_argument("--textline-orientation", type=lambda s: s.lower()=="true", default=True)
|
parser.add_argument("--textline-orientation", type=lambda s: s.lower()=="true", default=True)
|
||||||
parser.add_argument("--text-det-box-thresh", type=float, default=0.6)
|
# Detection pixel threshold for the text detection model. Pixels with scores greater than this threshold in the output probability map are considered text pixels.
|
||||||
|
parser.add_argument("--text-det-thresh", type=float, default=0.0)
|
||||||
|
# Detection box threshold for the text detection model. A detection result is considered a text region if the average score of all pixels within the border of the result is greater than this threshold.
|
||||||
|
parser.add_argument("--text-det-box-thresh", type=float, default=0.0)
|
||||||
|
# Text detection expansion coefficient, which expands the text region using this method. The larger the value, the larger the expansion area.
|
||||||
parser.add_argument("--text-det-unclip-ratio", type=float, default=1.5)
|
parser.add_argument("--text-det-unclip-ratio", type=float, default=1.5)
|
||||||
|
# Text recognition threshold. Text results with scores greater than this threshold are retained.
|
||||||
parser.add_argument("--text-rec-score-thresh", type=float, default=0.0)
|
parser.add_argument("--text-rec-score-thresh", type=float, default=0.0)
|
||||||
parser.add_argument("--line-tolerance", type=float, default=0.6)
|
# text location
|
||||||
parser.add_argument("--min-box-score", type=float, default=0.0)
|
|
||||||
parser.add_argument("--pages-per-pdf", type=int, default=2)
|
|
||||||
parser.add_argument("--lang", default="es")
|
parser.add_argument("--lang", default="es")
|
||||||
args = parser.parse_args()
|
args = parser.parse_args()
|
||||||
|
|
||||||
@@ -212,31 +140,29 @@ def main():
|
|||||||
lang=args.lang,
|
lang=args.lang,
|
||||||
)
|
)
|
||||||
|
|
||||||
|
dataset = ImageTextDataset(args.pdf_folder)
|
||||||
cer_list, wer_list = [], []
|
cer_list, wer_list = [], []
|
||||||
time_per_page_list = []
|
time_per_page_list = []
|
||||||
t0 = time.time()
|
t0 = time.time()
|
||||||
|
|
||||||
for fname in os.listdir(args.pdf_folder):
|
for img, ref in islice(dataset, 5, 10):
|
||||||
if not fname.lower().endswith(".pdf"):
|
arr = np.array(img)
|
||||||
continue
|
tp0 = time.time()
|
||||||
pdf_path = os.path.join(args.pdf_folder, fname)
|
out = ocr.predict(
|
||||||
images = pdf_to_images(pdf_path, dpi=args.dpi, pages=range(1, args.pages_per_pdf+1))
|
arr,
|
||||||
for i, img in enumerate(images):
|
use_doc_orientation_classify=args.use_doc_orientation_classify,
|
||||||
ref = pdf_extract_text(pdf_path, i+1)
|
use_doc_unwarping=args.use_doc_unwarping,
|
||||||
arr = np.array(img)
|
use_textline_orientation=args.textline_orientation, #str2bool Whether to use text line orientation classification.
|
||||||
tp0 = time.time()
|
text_det_thresh=args.text_det_thresh,
|
||||||
out = ocr.predict(
|
text_det_box_thresh=args.text_det_box_thresh,
|
||||||
arr,
|
text_det_unclip_ratio=args.text_det_unclip_ratio,
|
||||||
text_det_box_thresh=args.text_det_box_thresh,
|
text_rec_score_thresh=args.text_rec_score_thresh
|
||||||
text_det_unclip_ratio=args.text_det_unclip_ratio,
|
)
|
||||||
text_rec_score_thresh=args.text_rec_score_thresh,
|
pred = assemble_from_paddle_result(out)
|
||||||
use_textline_orientation=args.textline_orientation
|
time_per_page_list.append(float(time.time() - tp0))
|
||||||
)
|
m = evaluate_text(ref, pred)
|
||||||
pred = assemble_from_paddle_result(out, args.min_box_score, args.line_tolerance)
|
cer_list.append(m["CER"])
|
||||||
time_per_page_list.append(float(time.time() - tp0))
|
wer_list.append(m["WER"])
|
||||||
m = evaluate_text(ref, pred)
|
|
||||||
cer_list.append(m["CER"])
|
|
||||||
wer_list.append(m["WER"])
|
|
||||||
|
|
||||||
metrics = {
|
metrics = {
|
||||||
"CER": float(np.mean(cer_list) if cer_list else 1.0),
|
"CER": float(np.mean(cer_list) if cer_list else 1.0),
|
||||||
506
src/prepare_dataset.ipynb
Normal file
@@ -0,0 +1,506 @@
|
|||||||
|
{
|
||||||
|
"cells": [
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 1,
|
||||||
|
"id": "93809ffc",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"name": "stdout",
|
||||||
|
"output_type": "stream",
|
||||||
|
"text": [
|
||||||
|
"Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com\n",
|
||||||
|
"Requirement already satisfied: pip in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (25.3)\n",
|
||||||
|
"Note: you may need to restart the kernel to use updated packages.\n",
|
||||||
|
"Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com\n",
|
||||||
|
"Requirement already satisfied: jupyter in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (1.1.1)\n",
|
||||||
|
"Requirement already satisfied: notebook in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from jupyter) (7.4.7)\n",
|
||||||
|
"Requirement already satisfied: jupyter-console in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from jupyter) (6.6.3)\n",
|
||||||
|
"Requirement already satisfied: nbconvert in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from jupyter) (7.16.6)\n",
|
||||||
|
"Requirement already satisfied: ipykernel in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from jupyter) (7.1.0)\n",
|
||||||
|
"Requirement already satisfied: ipywidgets in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from jupyter) (8.1.8)\n",
|
||||||
|
"Requirement already satisfied: jupyterlab in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from jupyter) (4.4.10)\n",
|
||||||
|
"Requirement already satisfied: comm>=0.1.1 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from ipykernel->jupyter) (0.2.3)\n",
|
||||||
|
"Requirement already satisfied: debugpy>=1.6.5 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from ipykernel->jupyter) (1.8.17)\n",
|
||||||
|
"Requirement already satisfied: ipython>=7.23.1 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from ipykernel->jupyter) (9.7.0)\n",
|
||||||
|
"Requirement already satisfied: jupyter-client>=8.0.0 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from ipykernel->jupyter) (8.6.3)\n",
|
||||||
|
"Requirement already satisfied: jupyter-core!=5.0.*,>=4.12 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from ipykernel->jupyter) (5.9.1)\n",
|
||||||
|
"Requirement already satisfied: matplotlib-inline>=0.1 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from ipykernel->jupyter) (0.2.1)\n",
|
||||||
|
"Requirement already satisfied: nest-asyncio>=1.4 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from ipykernel->jupyter) (1.6.0)\n",
|
||||||
|
"Requirement already satisfied: packaging>=22 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from ipykernel->jupyter) (25.0)\n",
|
||||||
|
"Requirement already satisfied: psutil>=5.7 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from ipykernel->jupyter) (7.1.3)\n",
|
||||||
|
"Requirement already satisfied: pyzmq>=25 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from ipykernel->jupyter) (27.1.0)\n",
|
||||||
|
"Requirement already satisfied: tornado>=6.2 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from ipykernel->jupyter) (6.5.2)\n",
|
||||||
|
"Requirement already satisfied: traitlets>=5.4.0 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from ipykernel->jupyter) (5.14.3)\n",
|
||||||
|
"Requirement already satisfied: colorama>=0.4.4 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from ipython>=7.23.1->ipykernel->jupyter) (0.4.6)\n",
|
||||||
|
"Requirement already satisfied: decorator>=4.3.2 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from ipython>=7.23.1->ipykernel->jupyter) (5.2.1)\n",
|
||||||
|
"Requirement already satisfied: ipython-pygments-lexers>=1.0.0 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from ipython>=7.23.1->ipykernel->jupyter) (1.1.1)\n",
|
||||||
|
"Requirement already satisfied: jedi>=0.18.1 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from ipython>=7.23.1->ipykernel->jupyter) (0.19.2)\n",
|
||||||
|
"Requirement already satisfied: prompt_toolkit<3.1.0,>=3.0.41 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from ipython>=7.23.1->ipykernel->jupyter) (3.0.52)\n",
|
||||||
|
"Requirement already satisfied: pygments>=2.11.0 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from ipython>=7.23.1->ipykernel->jupyter) (2.19.2)\n",
|
||||||
|
"Requirement already satisfied: stack_data>=0.6.0 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from ipython>=7.23.1->ipykernel->jupyter) (0.6.3)\n",
|
||||||
|
"Requirement already satisfied: typing_extensions>=4.6 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from ipython>=7.23.1->ipykernel->jupyter) (4.15.0)\n",
|
||||||
|
"Requirement already satisfied: wcwidth in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from prompt_toolkit<3.1.0,>=3.0.41->ipython>=7.23.1->ipykernel->jupyter) (0.2.14)\n",
|
||||||
|
"Requirement already satisfied: parso<0.9.0,>=0.8.4 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from jedi>=0.18.1->ipython>=7.23.1->ipykernel->jupyter) (0.8.5)\n",
|
||||||
|
"Requirement already satisfied: python-dateutil>=2.8.2 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from jupyter-client>=8.0.0->ipykernel->jupyter) (2.9.0.post0)\n",
|
||||||
|
"Requirement already satisfied: platformdirs>=2.5 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from jupyter-core!=5.0.*,>=4.12->ipykernel->jupyter) (4.5.0)\n",
|
||||||
|
"Requirement already satisfied: six>=1.5 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from python-dateutil>=2.8.2->jupyter-client>=8.0.0->ipykernel->jupyter) (1.17.0)\n",
|
||||||
|
"Requirement already satisfied: executing>=1.2.0 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from stack_data>=0.6.0->ipython>=7.23.1->ipykernel->jupyter) (2.2.1)\n",
|
||||||
|
"Requirement already satisfied: asttokens>=2.1.0 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from stack_data>=0.6.0->ipython>=7.23.1->ipykernel->jupyter) (3.0.0)\n",
|
||||||
|
"Requirement already satisfied: pure-eval in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from stack_data>=0.6.0->ipython>=7.23.1->ipykernel->jupyter) (0.2.3)\n",
|
||||||
|
"Requirement already satisfied: widgetsnbextension~=4.0.14 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from ipywidgets->jupyter) (4.0.15)\n",
|
||||||
|
"Requirement already satisfied: jupyterlab_widgets~=3.0.15 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from ipywidgets->jupyter) (3.0.16)\n",
|
||||||
|
"Requirement already satisfied: async-lru>=1.0.0 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from jupyterlab->jupyter) (2.0.5)\n",
|
||||||
|
"Requirement already satisfied: httpx<1,>=0.25.0 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from jupyterlab->jupyter) (0.28.1)\n",
|
||||||
|
"Requirement already satisfied: jinja2>=3.0.3 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from jupyterlab->jupyter) (3.1.6)\n",
|
||||||
|
"Requirement already satisfied: jupyter-lsp>=2.0.0 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from jupyterlab->jupyter) (2.3.0)\n",
|
||||||
|
"Requirement already satisfied: jupyter-server<3,>=2.4.0 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from jupyterlab->jupyter) (2.17.0)\n",
|
||||||
|
"Requirement already satisfied: jupyterlab-server<3,>=2.27.1 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from jupyterlab->jupyter) (2.28.0)\n",
|
||||||
|
"Requirement already satisfied: notebook-shim>=0.2 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from jupyterlab->jupyter) (0.2.4)\n",
|
||||||
|
"Requirement already satisfied: setuptools>=41.1.0 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from jupyterlab->jupyter) (65.5.0)\n",
|
||||||
|
"Requirement already satisfied: anyio in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from httpx<1,>=0.25.0->jupyterlab->jupyter) (4.11.0)\n",
|
||||||
|
"Requirement already satisfied: certifi in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from httpx<1,>=0.25.0->jupyterlab->jupyter) (2025.10.5)\n",
|
||||||
|
"Requirement already satisfied: httpcore==1.* in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from httpx<1,>=0.25.0->jupyterlab->jupyter) (1.0.9)\n",
|
||||||
|
"Requirement already satisfied: idna in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from httpx<1,>=0.25.0->jupyterlab->jupyter) (3.11)\n",
|
||||||
|
"Requirement already satisfied: h11>=0.16 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from httpcore==1.*->httpx<1,>=0.25.0->jupyterlab->jupyter) (0.16.0)\n",
|
||||||
|
"Requirement already satisfied: argon2-cffi>=21.1 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from jupyter-server<3,>=2.4.0->jupyterlab->jupyter) (25.1.0)\n",
|
||||||
|
"Requirement already satisfied: jupyter-events>=0.11.0 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from jupyter-server<3,>=2.4.0->jupyterlab->jupyter) (0.12.0)\n",
|
||||||
|
"Requirement already satisfied: jupyter-server-terminals>=0.4.4 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from jupyter-server<3,>=2.4.0->jupyterlab->jupyter) (0.5.3)\n",
|
||||||
|
"Requirement already satisfied: nbformat>=5.3.0 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from jupyter-server<3,>=2.4.0->jupyterlab->jupyter) (5.10.4)\n",
|
||||||
|
"Requirement already satisfied: overrides>=5.0 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from jupyter-server<3,>=2.4.0->jupyterlab->jupyter) (7.7.0)\n",
|
||||||
|
"Requirement already satisfied: prometheus-client>=0.9 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from jupyter-server<3,>=2.4.0->jupyterlab->jupyter) (0.23.1)\n",
|
||||||
|
"Requirement already satisfied: pywinpty>=2.0.1 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from jupyter-server<3,>=2.4.0->jupyterlab->jupyter) (3.0.2)\n",
|
||||||
|
"Requirement already satisfied: send2trash>=1.8.2 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from jupyter-server<3,>=2.4.0->jupyterlab->jupyter) (1.8.3)\n",
|
||||||
|
"Requirement already satisfied: terminado>=0.8.3 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from jupyter-server<3,>=2.4.0->jupyterlab->jupyter) (0.18.1)\n",
|
||||||
|
"Requirement already satisfied: websocket-client>=1.7 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from jupyter-server<3,>=2.4.0->jupyterlab->jupyter) (1.9.0)\n",
|
||||||
|
"Requirement already satisfied: babel>=2.10 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from jupyterlab-server<3,>=2.27.1->jupyterlab->jupyter) (2.17.0)\n",
|
||||||
|
"Requirement already satisfied: json5>=0.9.0 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from jupyterlab-server<3,>=2.27.1->jupyterlab->jupyter) (0.12.1)\n",
|
||||||
|
"Requirement already satisfied: jsonschema>=4.18.0 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from jupyterlab-server<3,>=2.27.1->jupyterlab->jupyter) (4.25.1)\n",
|
||||||
|
"Requirement already satisfied: requests>=2.31 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from jupyterlab-server<3,>=2.27.1->jupyterlab->jupyter) (2.32.5)\n",
|
||||||
|
"Requirement already satisfied: sniffio>=1.1 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from anyio->httpx<1,>=0.25.0->jupyterlab->jupyter) (1.3.1)\n",
|
||||||
|
"Requirement already satisfied: argon2-cffi-bindings in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from argon2-cffi>=21.1->jupyter-server<3,>=2.4.0->jupyterlab->jupyter) (25.1.0)\n",
|
||||||
|
"Requirement already satisfied: MarkupSafe>=2.0 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from jinja2>=3.0.3->jupyterlab->jupyter) (3.0.3)\n",
|
||||||
|
"Requirement already satisfied: attrs>=22.2.0 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from jsonschema>=4.18.0->jupyterlab-server<3,>=2.27.1->jupyterlab->jupyter) (25.4.0)\n",
|
||||||
|
"Requirement already satisfied: jsonschema-specifications>=2023.03.6 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from jsonschema>=4.18.0->jupyterlab-server<3,>=2.27.1->jupyterlab->jupyter) (2025.9.1)\n",
|
||||||
|
"Requirement already satisfied: referencing>=0.28.4 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from jsonschema>=4.18.0->jupyterlab-server<3,>=2.27.1->jupyterlab->jupyter) (0.37.0)\n",
|
||||||
|
"Requirement already satisfied: rpds-py>=0.7.1 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from jsonschema>=4.18.0->jupyterlab-server<3,>=2.27.1->jupyterlab->jupyter) (0.28.0)\n",
|
||||||
|
"Requirement already satisfied: python-json-logger>=2.0.4 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from jupyter-events>=0.11.0->jupyter-server<3,>=2.4.0->jupyterlab->jupyter) (4.0.0)\n",
|
||||||
|
"Requirement already satisfied: pyyaml>=5.3 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from jupyter-events>=0.11.0->jupyter-server<3,>=2.4.0->jupyterlab->jupyter) (6.0.2)\n",
|
||||||
|
"Requirement already satisfied: rfc3339-validator in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from jupyter-events>=0.11.0->jupyter-server<3,>=2.4.0->jupyterlab->jupyter) (0.1.4)\n",
|
||||||
|
"Requirement already satisfied: rfc3986-validator>=0.1.1 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from jupyter-events>=0.11.0->jupyter-server<3,>=2.4.0->jupyterlab->jupyter) (0.1.1)\n",
|
||||||
|
"Requirement already satisfied: fqdn in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from jsonschema[format-nongpl]>=4.18.0->jupyter-events>=0.11.0->jupyter-server<3,>=2.4.0->jupyterlab->jupyter) (1.5.1)\n",
|
||||||
|
"Requirement already satisfied: isoduration in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from jsonschema[format-nongpl]>=4.18.0->jupyter-events>=0.11.0->jupyter-server<3,>=2.4.0->jupyterlab->jupyter) (20.11.0)\n",
|
||||||
|
"Requirement already satisfied: jsonpointer>1.13 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from jsonschema[format-nongpl]>=4.18.0->jupyter-events>=0.11.0->jupyter-server<3,>=2.4.0->jupyterlab->jupyter) (3.0.0)\n",
|
||||||
|
"Requirement already satisfied: rfc3987-syntax>=1.1.0 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from jsonschema[format-nongpl]>=4.18.0->jupyter-events>=0.11.0->jupyter-server<3,>=2.4.0->jupyterlab->jupyter) (1.1.0)\n",
|
||||||
|
"Requirement already satisfied: uri-template in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from jsonschema[format-nongpl]>=4.18.0->jupyter-events>=0.11.0->jupyter-server<3,>=2.4.0->jupyterlab->jupyter) (1.3.0)\n",
|
||||||
|
"Requirement already satisfied: webcolors>=24.6.0 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from jsonschema[format-nongpl]>=4.18.0->jupyter-events>=0.11.0->jupyter-server<3,>=2.4.0->jupyterlab->jupyter) (25.10.0)\n",
|
||||||
|
"Requirement already satisfied: beautifulsoup4 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from nbconvert->jupyter) (4.14.2)\n",
|
||||||
|
"Requirement already satisfied: bleach!=5.0.0 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from bleach[css]!=5.0.0->nbconvert->jupyter) (6.3.0)\n",
|
||||||
|
"Requirement already satisfied: defusedxml in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from nbconvert->jupyter) (0.7.1)\n",
|
||||||
|
"Requirement already satisfied: jupyterlab-pygments in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from nbconvert->jupyter) (0.3.0)\n",
|
||||||
|
"Requirement already satisfied: mistune<4,>=2.0.3 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from nbconvert->jupyter) (3.1.4)\n",
|
||||||
|
"Requirement already satisfied: nbclient>=0.5.0 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from nbconvert->jupyter) (0.10.2)\n",
|
||||||
|
"Requirement already satisfied: pandocfilters>=1.4.1 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from nbconvert->jupyter) (1.5.1)\n",
|
||||||
|
"Requirement already satisfied: webencodings in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from bleach!=5.0.0->bleach[css]!=5.0.0->nbconvert->jupyter) (0.5.1)\n",
|
||||||
|
"Requirement already satisfied: tinycss2<1.5,>=1.1.0 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from bleach[css]!=5.0.0->nbconvert->jupyter) (1.4.0)\n",
|
||||||
|
"Requirement already satisfied: fastjsonschema>=2.15 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from nbformat>=5.3.0->jupyter-server<3,>=2.4.0->jupyterlab->jupyter) (2.21.2)\n",
|
||||||
|
"Requirement already satisfied: charset_normalizer<4,>=2 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from requests>=2.31->jupyterlab-server<3,>=2.27.1->jupyterlab->jupyter) (3.4.4)\n",
|
||||||
|
"Requirement already satisfied: urllib3<3,>=1.21.1 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from requests>=2.31->jupyterlab-server<3,>=2.27.1->jupyterlab->jupyter) (2.5.0)\n",
|
||||||
|
"Requirement already satisfied: lark>=1.2.2 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from rfc3987-syntax>=1.1.0->jsonschema[format-nongpl]>=4.18.0->jupyter-events>=0.11.0->jupyter-server<3,>=2.4.0->jupyterlab->jupyter) (1.3.1)\n",
|
||||||
|
"Requirement already satisfied: cffi>=1.0.1 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from argon2-cffi-bindings->argon2-cffi>=21.1->jupyter-server<3,>=2.4.0->jupyterlab->jupyter) (2.0.0)\n",
|
||||||
|
"Requirement already satisfied: pycparser in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from cffi>=1.0.1->argon2-cffi-bindings->argon2-cffi>=21.1->jupyter-server<3,>=2.4.0->jupyterlab->jupyter) (2.23)\n",
|
||||||
|
"Requirement already satisfied: soupsieve>1.2 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from beautifulsoup4->nbconvert->jupyter) (2.8)\n",
|
||||||
|
"Requirement already satisfied: arrow>=0.15.0 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from isoduration->jsonschema[format-nongpl]>=4.18.0->jupyter-events>=0.11.0->jupyter-server<3,>=2.4.0->jupyterlab->jupyter) (1.4.0)\n",
|
||||||
|
"Requirement already satisfied: tzdata in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from arrow>=0.15.0->isoduration->jsonschema[format-nongpl]>=4.18.0->jupyter-events>=0.11.0->jupyter-server<3,>=2.4.0->jupyterlab->jupyter) (2025.2)\n",
|
||||||
|
"Note: you may need to restart the kernel to use updated packages.\n",
|
||||||
|
"Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com\n",
|
||||||
|
"Requirement already satisfied: ipywidgets in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (8.1.8)\n",
|
||||||
|
"Requirement already satisfied: comm>=0.1.3 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from ipywidgets) (0.2.3)\n",
|
||||||
|
"Requirement already satisfied: ipython>=6.1.0 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from ipywidgets) (9.7.0)\n",
|
||||||
|
"Requirement already satisfied: traitlets>=4.3.1 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from ipywidgets) (5.14.3)\n",
|
||||||
|
"Requirement already satisfied: widgetsnbextension~=4.0.14 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from ipywidgets) (4.0.15)\n",
|
||||||
|
"Requirement already satisfied: jupyterlab_widgets~=3.0.15 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from ipywidgets) (3.0.16)\n",
|
||||||
|
"Requirement already satisfied: colorama>=0.4.4 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from ipython>=6.1.0->ipywidgets) (0.4.6)\n",
|
||||||
|
"Requirement already satisfied: decorator>=4.3.2 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from ipython>=6.1.0->ipywidgets) (5.2.1)\n",
|
||||||
|
"Requirement already satisfied: ipython-pygments-lexers>=1.0.0 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from ipython>=6.1.0->ipywidgets) (1.1.1)\n",
|
||||||
|
"Requirement already satisfied: jedi>=0.18.1 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from ipython>=6.1.0->ipywidgets) (0.19.2)\n",
|
||||||
|
"Requirement already satisfied: matplotlib-inline>=0.1.5 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from ipython>=6.1.0->ipywidgets) (0.2.1)\n",
|
||||||
|
"Requirement already satisfied: prompt_toolkit<3.1.0,>=3.0.41 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from ipython>=6.1.0->ipywidgets) (3.0.52)\n",
|
||||||
|
"Requirement already satisfied: pygments>=2.11.0 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from ipython>=6.1.0->ipywidgets) (2.19.2)\n",
|
||||||
|
"Requirement already satisfied: stack_data>=0.6.0 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from ipython>=6.1.0->ipywidgets) (0.6.3)\n",
|
||||||
|
"Requirement already satisfied: typing_extensions>=4.6 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from ipython>=6.1.0->ipywidgets) (4.15.0)\n",
|
||||||
|
"Requirement already satisfied: wcwidth in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from prompt_toolkit<3.1.0,>=3.0.41->ipython>=6.1.0->ipywidgets) (0.2.14)\n",
|
||||||
|
"Requirement already satisfied: parso<0.9.0,>=0.8.4 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from jedi>=0.18.1->ipython>=6.1.0->ipywidgets) (0.8.5)\n",
|
||||||
|
"Requirement already satisfied: executing>=1.2.0 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from stack_data>=0.6.0->ipython>=6.1.0->ipywidgets) (2.2.1)\n",
|
||||||
|
"Requirement already satisfied: asttokens>=2.1.0 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from stack_data>=0.6.0->ipython>=6.1.0->ipywidgets) (3.0.0)\n",
|
||||||
|
"Requirement already satisfied: pure-eval in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from stack_data>=0.6.0->ipython>=6.1.0->ipywidgets) (0.2.3)\n",
|
||||||
|
"Note: you may need to restart the kernel to use updated packages.\n",
|
||||||
|
"Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com\n",
|
||||||
|
"Requirement already satisfied: ipykernel in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (7.1.0)\n",
|
||||||
|
"Requirement already satisfied: comm>=0.1.1 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from ipykernel) (0.2.3)\n",
|
||||||
|
"Requirement already satisfied: debugpy>=1.6.5 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from ipykernel) (1.8.17)\n",
|
||||||
|
"Requirement already satisfied: ipython>=7.23.1 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from ipykernel) (9.7.0)\n",
|
||||||
|
"Requirement already satisfied: jupyter-client>=8.0.0 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from ipykernel) (8.6.3)\n",
|
||||||
|
"Requirement already satisfied: jupyter-core!=5.0.*,>=4.12 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from ipykernel) (5.9.1)\n",
|
||||||
|
"Requirement already satisfied: matplotlib-inline>=0.1 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from ipykernel) (0.2.1)\n",
|
||||||
|
"Requirement already satisfied: nest-asyncio>=1.4 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from ipykernel) (1.6.0)\n",
|
||||||
|
"Requirement already satisfied: packaging>=22 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from ipykernel) (25.0)\n",
|
||||||
|
"Requirement already satisfied: psutil>=5.7 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from ipykernel) (7.1.3)\n",
|
||||||
|
"Requirement already satisfied: pyzmq>=25 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from ipykernel) (27.1.0)\n",
|
||||||
|
"Requirement already satisfied: tornado>=6.2 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from ipykernel) (6.5.2)\n",
|
||||||
|
"Requirement already satisfied: traitlets>=5.4.0 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from ipykernel) (5.14.3)\n",
|
||||||
|
"Requirement already satisfied: colorama>=0.4.4 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from ipython>=7.23.1->ipykernel) (0.4.6)\n",
|
||||||
|
"Requirement already satisfied: decorator>=4.3.2 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from ipython>=7.23.1->ipykernel) (5.2.1)\n",
|
||||||
|
"Requirement already satisfied: ipython-pygments-lexers>=1.0.0 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from ipython>=7.23.1->ipykernel) (1.1.1)\n",
|
||||||
|
"Requirement already satisfied: jedi>=0.18.1 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from ipython>=7.23.1->ipykernel) (0.19.2)\n",
|
||||||
|
"Requirement already satisfied: prompt_toolkit<3.1.0,>=3.0.41 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from ipython>=7.23.1->ipykernel) (3.0.52)\n",
|
||||||
|
"Requirement already satisfied: pygments>=2.11.0 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from ipython>=7.23.1->ipykernel) (2.19.2)\n",
|
||||||
|
"Requirement already satisfied: stack_data>=0.6.0 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from ipython>=7.23.1->ipykernel) (0.6.3)\n",
|
||||||
|
"Requirement already satisfied: typing_extensions>=4.6 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from ipython>=7.23.1->ipykernel) (4.15.0)\n",
|
||||||
|
"Requirement already satisfied: wcwidth in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from prompt_toolkit<3.1.0,>=3.0.41->ipython>=7.23.1->ipykernel) (0.2.14)\n",
|
||||||
|
"Requirement already satisfied: parso<0.9.0,>=0.8.4 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from jedi>=0.18.1->ipython>=7.23.1->ipykernel) (0.8.5)\n",
|
||||||
|
"Requirement already satisfied: python-dateutil>=2.8.2 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from jupyter-client>=8.0.0->ipykernel) (2.9.0.post0)\n",
|
||||||
|
"Requirement already satisfied: platformdirs>=2.5 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from jupyter-core!=5.0.*,>=4.12->ipykernel) (4.5.0)\n",
|
||||||
|
"Requirement already satisfied: six>=1.5 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from python-dateutil>=2.8.2->jupyter-client>=8.0.0->ipykernel) (1.17.0)\n",
|
||||||
|
"Requirement already satisfied: executing>=1.2.0 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from stack_data>=0.6.0->ipython>=7.23.1->ipykernel) (2.2.1)\n",
|
||||||
|
"Requirement already satisfied: asttokens>=2.1.0 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from stack_data>=0.6.0->ipython>=7.23.1->ipykernel) (3.0.0)\n",
|
||||||
|
"Requirement already satisfied: pure-eval in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from stack_data>=0.6.0->ipython>=7.23.1->ipykernel) (0.2.3)\n",
|
||||||
|
"Note: you may need to restart the kernel to use updated packages.\n"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"%pip install --upgrade pip\n",
|
||||||
|
"%pip install --upgrade jupyter\n",
|
||||||
|
"%pip install --upgrade ipywidgets\n",
|
||||||
|
"%pip install --upgrade ipykernel"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 2,
|
||||||
|
"id": "48724594",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"name": "stdout",
|
||||||
|
"output_type": "stream",
|
||||||
|
"text": [
|
||||||
|
"Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com\n",
|
||||||
|
"Requirement already satisfied: pdf2image in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (1.17.0)\n",
|
||||||
|
"Requirement already satisfied: pillow in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (12.0.0)\n",
|
||||||
|
"Note: you may need to restart the kernel to use updated packages.\n",
|
||||||
|
"Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com\n",
|
||||||
|
"Requirement already satisfied: PyMuPDF in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (1.26.6)\n",
|
||||||
|
"Note: you may need to restart the kernel to use updated packages.\n",
|
||||||
|
"Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com\n",
|
||||||
|
"Requirement already satisfied: pandas in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (2.3.3)\n",
|
||||||
|
"Requirement already satisfied: numpy>=1.23.2 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from pandas) (2.3.4)\n",
|
||||||
|
"Requirement already satisfied: python-dateutil>=2.8.2 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from pandas) (2.9.0.post0)\n",
|
||||||
|
"Requirement already satisfied: pytz>=2020.1 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from pandas) (2025.2)\n",
|
||||||
|
"Requirement already satisfied: tzdata>=2022.7 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from pandas) (2025.2)\n",
|
||||||
|
"Requirement already satisfied: six>=1.5 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from python-dateutil>=2.8.2->pandas) (1.17.0)\n",
|
||||||
|
"Note: you may need to restart the kernel to use updated packages.\n",
|
||||||
|
"Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com\n",
|
||||||
|
"Requirement already satisfied: matplotlib in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (3.10.7)\n",
|
||||||
|
"Requirement already satisfied: contourpy>=1.0.1 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from matplotlib) (1.3.3)\n",
|
||||||
|
"Requirement already satisfied: cycler>=0.10 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from matplotlib) (0.12.1)\n",
|
||||||
|
"Requirement already satisfied: fonttools>=4.22.0 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from matplotlib) (4.60.1)\n",
|
||||||
|
"Requirement already satisfied: kiwisolver>=1.3.1 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from matplotlib) (1.4.9)\n",
|
||||||
|
"Requirement already satisfied: numpy>=1.23 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from matplotlib) (2.3.4)\n",
|
||||||
|
"Requirement already satisfied: packaging>=20.0 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from matplotlib) (25.0)\n",
|
||||||
|
"Requirement already satisfied: pillow>=8 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from matplotlib) (12.0.0)\n",
|
||||||
|
"Requirement already satisfied: pyparsing>=3 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from matplotlib) (3.2.5)\n",
|
||||||
|
"Requirement already satisfied: python-dateutil>=2.7 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from matplotlib) (2.9.0.post0)\n",
|
||||||
|
"Requirement already satisfied: six>=1.5 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from python-dateutil>=2.7->matplotlib) (1.17.0)\n",
|
||||||
|
"Note: you may need to restart the kernel to use updated packages.\n",
|
||||||
|
"Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com\n",
|
||||||
|
"Requirement already satisfied: seaborn in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (0.13.2)\n",
|
||||||
|
"Requirement already satisfied: numpy!=1.24.0,>=1.20 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from seaborn) (2.3.4)\n",
|
||||||
|
"Requirement already satisfied: pandas>=1.2 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from seaborn) (2.3.3)\n",
|
||||||
|
"Requirement already satisfied: matplotlib!=3.6.1,>=3.4 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from seaborn) (3.10.7)\n",
|
||||||
|
"Requirement already satisfied: contourpy>=1.0.1 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (1.3.3)\n",
|
||||||
|
"Requirement already satisfied: cycler>=0.10 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (0.12.1)\n",
|
||||||
|
"Requirement already satisfied: fonttools>=4.22.0 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (4.60.1)\n",
|
||||||
|
"Requirement already satisfied: kiwisolver>=1.3.1 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (1.4.9)\n",
|
||||||
|
"Requirement already satisfied: packaging>=20.0 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (25.0)\n",
|
||||||
|
"Requirement already satisfied: pillow>=8 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (12.0.0)\n",
|
||||||
|
"Requirement already satisfied: pyparsing>=3 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (3.2.5)\n",
|
||||||
|
"Requirement already satisfied: python-dateutil>=2.7 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (2.9.0.post0)\n",
|
||||||
|
"Requirement already satisfied: pytz>=2020.1 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from pandas>=1.2->seaborn) (2025.2)\n",
|
||||||
|
"Requirement already satisfied: tzdata>=2022.7 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from pandas>=1.2->seaborn) (2025.2)\n",
|
||||||
|
"Requirement already satisfied: six>=1.5 in c:\\users\\sji\\desktop\\mastersthesis\\.venv\\lib\\site-packages (from python-dateutil>=2.7->matplotlib!=3.6.1,>=3.4->seaborn) (1.17.0)\n",
|
||||||
|
"Note: you may need to restart the kernel to use updated packages.\n"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"# Install necessary packages\n",
|
||||||
|
"%pip install pdf2image pillow \n",
|
||||||
|
"# pdf reading\n",
|
||||||
|
"%pip install PyMuPDF\n",
|
||||||
|
"\n",
|
||||||
|
"# Data analysis and visualization\n",
|
||||||
|
"%pip install pandas\n",
|
||||||
|
"%pip install matplotlib\n",
|
||||||
|
"%pip install seaborn"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 3,
|
||||||
|
"id": "e1f793b6",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"import os, json\n",
|
||||||
|
"import numpy as np\n",
|
||||||
|
"import pandas as pd\n",
|
||||||
|
"import matplotlib.pyplot as plt\n",
|
||||||
|
"from pdf2image import convert_from_path\n",
|
||||||
|
"from PIL import Image, ImageOps\n",
|
||||||
|
"import fitz # PyMuPDF\n",
|
||||||
|
"import re\n",
|
||||||
|
"from datetime import datetime\n",
|
||||||
|
"from typing import List\n",
|
||||||
|
"import shutil"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 4,
|
||||||
|
"id": "1652a78e",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"def pdf_to_images(pdf_path: str, output_dir: str, dpi: int = 300):\n",
|
||||||
|
" \"\"\"\n",
|
||||||
|
" Render a PDF into a list of PIL Images using PyMuPDF or pdf2image.\n",
|
||||||
|
" 'pages' is 1-based (e.g., range(1, 10) -> pages 1–9).\n",
|
||||||
|
" \"\"\"\n",
|
||||||
|
" if fitz is not None:\n",
|
||||||
|
" doc = fitz.open(pdf_path)\n",
|
||||||
|
" total_pages = len(doc)\n",
|
||||||
|
"\n",
|
||||||
|
" # Adjust page indices (PyMuPDF uses 0-based indexing)\n",
|
||||||
|
" page_indices = list(range(total_pages))\n",
|
||||||
|
"\n",
|
||||||
|
" for i in page_indices:\n",
|
||||||
|
" page = doc.load_page(i)\n",
|
||||||
|
" mat = fitz.Matrix(dpi / 72.0, dpi / 72.0)\n",
|
||||||
|
" pix = page.get_pixmap(matrix=mat, alpha=False)\n",
|
||||||
|
" img = Image.frombytes(\"RGB\", [pix.width, pix.height], pix.samples)\n",
|
||||||
|
" # Build filename\n",
|
||||||
|
" out_path = os.path.join(\n",
|
||||||
|
" output_dir,\n",
|
||||||
|
" f\"page_{i + 1:04d}.png\"\n",
|
||||||
|
" )\n",
|
||||||
|
"\n",
|
||||||
|
" img.save(out_path, \"PNG\")\n",
|
||||||
|
" doc.close()\n",
|
||||||
|
" else:\n",
|
||||||
|
" raise RuntimeError(\"Install PyMuPDF or pdf2image to convert PDFs.\")"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 5,
|
||||||
|
"id": "f523dd58",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"import fitz\n",
|
||||||
|
"import re\n",
|
||||||
|
"import os\n",
|
||||||
|
"\n",
|
||||||
|
"def _pdf_extract_text_structured(page, margin_threshold=50):\n",
|
||||||
|
" \"\"\"\n",
|
||||||
|
" Extract text using PyMuPDF's dict mode which preserves\n",
|
||||||
|
" the actual line structure from the PDF.\n",
|
||||||
|
" \"\"\"\n",
|
||||||
|
" data = page.get_text(\"dict\")\n",
|
||||||
|
" \n",
|
||||||
|
" # Collect all lines with their Y position\n",
|
||||||
|
" all_lines = []\n",
|
||||||
|
" margin_text_parts = [] # Collect vertical/margin text\n",
|
||||||
|
" margin_y_positions = []\n",
|
||||||
|
" \n",
|
||||||
|
" for block in data.get(\"blocks\", []):\n",
|
||||||
|
" if block.get(\"type\") != 0: # Skip non-text blocks\n",
|
||||||
|
" continue\n",
|
||||||
|
" \n",
|
||||||
|
" block_bbox = block.get(\"bbox\", (0, 0, 0, 0))\n",
|
||||||
|
" block_width = block_bbox[2] - block_bbox[0]\n",
|
||||||
|
" block_height = block_bbox[3] - block_bbox[1]\n",
|
||||||
|
" \n",
|
||||||
|
" # Detect vertical/margin text\n",
|
||||||
|
" is_margin_text = (block_bbox[0] < margin_threshold or \n",
|
||||||
|
" block_height > block_width * 2)\n",
|
||||||
|
" \n",
|
||||||
|
" for line in block.get(\"lines\", []):\n",
|
||||||
|
" direction = line.get(\"dir\", (1, 0))\n",
|
||||||
|
" bbox = line.get(\"bbox\", (0, 0, 0, 0))\n",
|
||||||
|
" y_center = (bbox[1] + bbox[3]) / 2\n",
|
||||||
|
" x_start = bbox[0]\n",
|
||||||
|
" \n",
|
||||||
|
" # Collect text from all spans\n",
|
||||||
|
" line_text = \"\"\n",
|
||||||
|
" for span in line.get(\"spans\", []):\n",
|
||||||
|
" text = span.get(\"text\", \"\")\n",
|
||||||
|
" line_text += text\n",
|
||||||
|
" \n",
|
||||||
|
" line_text = line_text.strip()\n",
|
||||||
|
" line_text = re.sub(r\"[•▪◦●❖▶■\\uf000-\\uf0ff]\", \"\", line_text)\n",
|
||||||
|
" \n",
|
||||||
|
" if not line_text:\n",
|
||||||
|
" continue\n",
|
||||||
|
" \n",
|
||||||
|
" # Check if this is margin/vertical text\n",
|
||||||
|
" if is_margin_text or abs(direction[0]) < 0.9:\n",
|
||||||
|
" margin_text_parts.append((y_center, line_text))\n",
|
||||||
|
" margin_y_positions.append(y_center)\n",
|
||||||
|
" else:\n",
|
||||||
|
" all_lines.append((y_center, x_start, line_text))\n",
|
||||||
|
" \n",
|
||||||
|
" # Reconstruct margin text as single line at its vertical center\n",
|
||||||
|
" if margin_text_parts:\n",
|
||||||
|
" # Sort by Y position (top to bottom) and join\n",
|
||||||
|
" margin_text_parts.sort(key=lambda x: x[0])\n",
|
||||||
|
" full_margin_text = \" \".join(part[1] for part in margin_text_parts)\n",
|
||||||
|
" # Calculate vertical center of the watermark\n",
|
||||||
|
" avg_y = sum(margin_y_positions) / len(margin_y_positions)\n",
|
||||||
|
" # Add as a single line\n",
|
||||||
|
" all_lines.append((avg_y, -1, full_margin_text)) # x=-1 to sort first\n",
|
||||||
|
" \n",
|
||||||
|
" if not all_lines:\n",
|
||||||
|
" return \"\"\n",
|
||||||
|
" \n",
|
||||||
|
" # Sort by Y first, then by X\n",
|
||||||
|
" all_lines.sort(key=lambda x: (x[0], x[1]))\n",
|
||||||
|
" \n",
|
||||||
|
" # Group lines at same vertical position\n",
|
||||||
|
" merged_rows = []\n",
|
||||||
|
" current_row = [all_lines[0]]\n",
|
||||||
|
" current_y = all_lines[0][0]\n",
|
||||||
|
" \n",
|
||||||
|
" for y_center, x_start, text in all_lines[1:]:\n",
|
||||||
|
" if abs(y_center - current_y) <= 2:\n",
|
||||||
|
" current_row.append((y_center, x_start, text))\n",
|
||||||
|
" else:\n",
|
||||||
|
" current_row.sort(key=lambda x: x[1])\n",
|
||||||
|
" row_text = \" \".join(item[2] for item in current_row)\n",
|
||||||
|
" merged_rows.append((current_y, row_text))\n",
|
||||||
|
" current_row = [(y_center, x_start, text)]\n",
|
||||||
|
" current_y = y_center\n",
|
||||||
|
" \n",
|
||||||
|
" if current_row:\n",
|
||||||
|
" current_row.sort(key=lambda x: x[1])\n",
|
||||||
|
" row_text = \" \".join(item[2] for item in current_row)\n",
|
||||||
|
" merged_rows.append((current_y, row_text))\n",
|
||||||
|
" \n",
|
||||||
|
" # Sort rows by Y and extract text\n",
|
||||||
|
" merged_rows.sort(key=lambda x: x[0])\n",
|
||||||
|
" lines = [row[1] for row in merged_rows]\n",
|
||||||
|
" \n",
|
||||||
|
" # Join and clean up\n",
|
||||||
|
" text = \"\\n\".join(lines)\n",
|
||||||
|
" text = re.sub(r\" +\", \" \", text).strip()\n",
|
||||||
|
" text = re.sub(r\"\\n{3,}\", \"\\n\\n\", text).strip()\n",
|
||||||
|
" \n",
|
||||||
|
" return text\n",
|
||||||
|
"\n",
|
||||||
|
"def pdf_extract_text(pdf_path, output_dir, margin_threshold=50):\n",
|
||||||
|
" os.makedirs(output_dir, exist_ok=True)\n",
|
||||||
|
" doc = fitz.open(pdf_path)\n",
|
||||||
|
" \n",
|
||||||
|
" for i, page in enumerate(doc):\n",
|
||||||
|
" text = _pdf_extract_text_structured(page, margin_threshold)\n",
|
||||||
|
" if not text.strip():\n",
|
||||||
|
" continue\n",
|
||||||
|
" out_path = os.path.join(output_dir, f\"page_{i + 1:04d}.txt\")\n",
|
||||||
|
" with open(out_path, \"w\", encoding=\"utf-8\") as f:\n",
|
||||||
|
" f.write(text)"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 6,
|
||||||
|
"id": "9f64a8c0",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"PDF_FOLDER = './../instructions' # Folder containing PDF files\n",
|
||||||
|
"OUTPUT_FOLDER = './dataset'\n",
|
||||||
|
"\n",
|
||||||
|
"os.makedirs(OUTPUT_FOLDER, exist_ok=True)"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 7,
|
||||||
|
"id": "41e4651d",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"i = 0\n",
|
||||||
|
"\n",
|
||||||
|
"pdf_files = sorted([\n",
|
||||||
|
" fname for fname in os.listdir(PDF_FOLDER)\n",
|
||||||
|
" if fname.lower().endswith(\".pdf\")\n",
|
||||||
|
"])\n",
|
||||||
|
"\n",
|
||||||
|
"\n",
|
||||||
|
"for fname in pdf_files:\n",
|
||||||
|
" # build output directories\n",
|
||||||
|
" out_img_path = os.path.join(OUTPUT_FOLDER, str(i), \"img\")\n",
|
||||||
|
" out_txt_path = os.path.join(OUTPUT_FOLDER, str(i), \"txt\")\n",
|
||||||
|
"\n",
|
||||||
|
" os.makedirs(out_img_path, exist_ok=True)\n",
|
||||||
|
" os.makedirs(out_txt_path, exist_ok=True)\n",
|
||||||
|
"\n",
|
||||||
|
" # source and destination PDF paths\n",
|
||||||
|
" src_pdf = os.path.join(PDF_FOLDER, fname)\n",
|
||||||
|
" pdf_path = os.path.join(OUTPUT_FOLDER, str(i), fname)\n",
|
||||||
|
"\n",
|
||||||
|
" # copy PDF into numbered folder\n",
|
||||||
|
" shutil.copy(src_pdf, pdf_path)\n",
|
||||||
|
"\n",
|
||||||
|
" # convert PDF → images\n",
|
||||||
|
" pdf_to_images(\n",
|
||||||
|
" pdf_path=pdf_path,\n",
|
||||||
|
" output_dir=out_img_path,\n",
|
||||||
|
" dpi=300\n",
|
||||||
|
" )\n",
|
||||||
|
" pdf_extract_text(\n",
|
||||||
|
" pdf_path=pdf_path,\n",
|
||||||
|
" output_dir=out_txt_path,\n",
|
||||||
|
" margin_threshold=40\n",
|
||||||
|
" )\n",
|
||||||
|
"\n",
|
||||||
|
" i += 1"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"metadata": {
|
||||||
|
"kernelspec": {
|
||||||
|
"display_name": ".venv (3.11.9)",
|
||||||
|
"language": "python",
|
||||||
|
"name": "python3"
|
||||||
|
},
|
||||||
|
"language_info": {
|
||||||
|
"codemirror_mode": {
|
||||||
|
"name": "ipython",
|
||||||
|
"version": 3
|
||||||
|
},
|
||||||
|
"file_extension": ".py",
|
||||||
|
"mimetype": "text/x-python",
|
||||||
|
"name": "python",
|
||||||
|
"nbconvert_exporter": "python",
|
||||||
|
"pygments_lexer": "ipython3",
|
||||||
|
"version": "3.11.9"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"nbformat": 4,
|
||||||
|
"nbformat_minor": 5
|
||||||
|
}
|
||||||
65
src/raytune_paddle_subproc_results_20251207_192320.csv
Normal file
@@ -0,0 +1,65 @@
|
|||||||
|
CER,WER,TIME,PAGES,TIME_PER_PAGE,timestamp,checkpoint_dir_name,done,training_iteration,trial_id,date,time_this_iter_s,time_total_s,pid,hostname,node_ip,time_since_restore,iterations_since_restore,config/use_doc_orientation_classify,config/use_doc_unwarping,config/textline_orientation,config/text_det_thresh,config/text_det_box_thresh,config/text_det_unclip_ratio,config/text_rec_score_thresh,logdir
|
||||||
|
0.013515850203159258,0.1050034776034098,353.85077571868896,5,70.66230463981628,1765120215,,False,1,d5238c33,2025-12-07_16-10-15,374.27777338027954,374.27777338027954,19452,LAPTOP-2OQK6GT5,127.0.0.1,374.27777338027954,1,True,False,True,0.08878208965533294,0.623029468177504,0.0,0.22994386685874743,d5238c33
|
||||||
|
0.03905195479212187,0.13208645252197226,354.61478638648987,5,70.82208666801452,1765120220,,False,1,ea8a2f7a,2025-12-07_16-10-20,374.2999520301819,374.2999520301819,7472,LAPTOP-2OQK6GT5,127.0.0.1,374.2999520301819,1,False,False,False,0.39320080607112917,0.6712014538998344,0.0,0.16880221913810864,ea8a2f7a
|
||||||
|
0.06606238373546518,0.16619192810354325,359.09717535972595,5,71.72569246292115,1765120601,,False,1,ebb12e5b,2025-12-07_16-16-41,379.5437698364258,379.5437698364258,21480,LAPTOP-2OQK6GT5,127.0.0.1,379.5437698364258,1,True,True,True,0.4328784710891528,0.23572507118228522,0.0,0.18443532434104057,ebb12e5b
|
||||||
|
0.41810946199338,0.5037103242611287,336.6613118648529,5,67.22685413360595,1765120583,,False,1,b3775034,2025-12-07_16-16-23,356.52618169784546,356.52618169784546,23084,LAPTOP-2OQK6GT5,127.0.0.1,356.52618169784546,1,True,True,False,0.06412882230680782,0.3377439247010605,0.0,0.5764053439963283,b3775034
|
||||||
|
0.1972515944870667,0.2953531713611584,350.1465151309967,5,69.93639450073242,1765120959,,False,1,bf10d370,2025-12-07_16-22-39,370.90337228775024,370.90337228775024,26140,LAPTOP-2OQK6GT5,127.0.0.1,370.90337228775024,1,True,True,True,0.6719551054359146,0.6902317374774642,0.0,0.3964896632708511,bf10d370
|
||||||
|
0.3864103728596727,0.45583610828383464,320.96620512008667,5,64.09520988464355,1765120947,,False,1,111e5a9e,2025-12-07_16-22-27,341.0712642669678,341.0712642669678,20664,LAPTOP-2OQK6GT5,127.0.0.1,341.0712642669678,1,True,False,False,0.04481600265034593,0.4832664381621284,0.0,0.5464155154391461,111e5a9e
|
||||||
|
0.5160689446919982,0.5945298276300801,326.65670347213745,5,65.2350733757019,1765121300,,False,1,415d7ba1,2025-12-07_16-28-20,347.29887080192566,347.29887080192566,23848,LAPTOP-2OQK6GT5,127.0.0.1,347.29887080192566,1,True,True,True,0.01699705273201909,0.5233849789194689,0.0,0.20833106578160068,415d7ba1
|
||||||
|
0.5025130639131208,0.5677161936883898,326.9156484603882,5,65.28343558311462,1765121310,,False,1,a58d8109,2025-12-07_16-28-30,346.09022212028503,346.09022212028503,25248,LAPTOP-2OQK6GT5,127.0.0.1,346.09022212028503,1,False,True,True,0.04024319071476844,0.6705892008057031,0.0,0.1885847677314521,a58d8109
|
||||||
|
0.07092029393242118,0.17390976502682037,368.5711796283722,5,73.62503981590271,1765121692,,False,1,33bdf2a9,2025-12-07_16-34-52,388.150607585907,388.150607585907,24024,LAPTOP-2OQK6GT5,127.0.0.1,388.150607585907,1,False,True,False,0.4347371576992484,0.490009080993297,0.0,0.1519055407457635,33bdf2a9
|
||||||
|
0.1168252568583151,0.22212978798067146,364.6228621006012,5,72.82479510307311,1765121699,,False,1,d9df79f3,2025-12-07_16-34-59,384.67676973342896,384.67676973342896,5368,LAPTOP-2OQK6GT5,127.0.0.1,384.67676973342896,1,True,True,False,0.17806350429159667,0.6261942434824851,0.0,0.38547742746319813,d9df79f3
|
||||||
|
0.06459478599489028,0.16493742503085831,366.6067085266113,5,73.22199411392212,1765122086,,False,1,80ea65f2,2025-12-07_16-41-26,387.6792531013489,387.6792531013489,14064,LAPTOP-2OQK6GT5,127.0.0.1,387.6792531013489,1,True,True,False,0.6011116675422127,0.25138233186284487,0.0,0.31312371671514233,80ea65f2
|
||||||
|
0.01340057642794312,0.10741926673961485,359.5969452857971,5,71.80434017181396,1765122084,,False,1,2e978bfa,2025-12-07_16-41-24,380.28105759620667,380.28105759620667,11060,LAPTOP-2OQK6GT5,127.0.0.1,380.28105759620667,1,False,False,True,0.23485911670668447,0.07773192307960775,0.0,0.023694797982285992,2e978bfa
|
||||||
|
0.01340057642794312,0.10741926673961485,347.92934703826904,5,69.49003491401672,1765122459,,False,1,8518cc40,2025-12-07_16-47-39,368.54625153541565,368.54625153541565,21016,LAPTOP-2OQK6GT5,127.0.0.1,368.54625153541565,1,False,False,True,0.2225556801158737,0.00024186765038358704,0.0,0.0028910785387807336,8518cc40
|
||||||
|
0.01340057642794312,0.10741926673961485,347.14498376846313,5,69.324178647995,1765122461,,False,1,2c691aaa,2025-12-07_16-47-41,366.3459825515747,366.3459825515747,21540,LAPTOP-2OQK6GT5,127.0.0.1,366.3459825515747,1,False,False,True,0.22472742766369874,0.030333356491349384,0.0,0.05099688981312009,2c691aaa
|
||||||
|
0.013040374955575204,0.10485434443992256,347.22006940841675,5,69.34554209709168,1765122832,,False,1,31e60691,2025-12-07_16-53-52,368.0382122993469,368.0382122993469,17532,LAPTOP-2OQK6GT5,127.0.0.1,368.0382122993469,1,False,False,True,0.25914070057597594,0.0019604082489898533,0.0,0.0035094431353713818,31e60691
|
||||||
|
0.012582941415352794,0.10327954129031627,349.2319846153259,5,69.74626359939575,1765122837,,False,1,d4d288c6,2025-12-07_16-53-57,368.903502702713,368.903502702713,22216,LAPTOP-2OQK6GT5,127.0.0.1,368.903502702713,1,False,False,True,0.2734075225731028,0.0033989235904911125,0.0,0.015420451500634869,d4d288c6
|
||||||
|
0.012582941415352794,0.10327954129031627,346.6979134082794,5,69.24065437316895,1765123205,,False,1,7645b77c,2025-12-07_17-00-05,367.4564206600189,367.4564206600189,2272,LAPTOP-2OQK6GT5,127.0.0.1,367.4564206600189,1,False,False,True,0.279241869770728,0.1138413707810162,0.0,0.07531508117874008,7645b77c
|
||||||
|
0.012407575745987933,0.10201566081383735,346.5196530818939,5,69.19977960586547,1765123208,,False,1,3256ae36,2025-12-07_17-00-08,366.00227642059326,366.00227642059326,6604,LAPTOP-2OQK6GT5,127.0.0.1,366.00227642059326,1,False,False,True,0.30993017979826853,0.1292131176570399,0.0,0.11201957956206357,3256ae36
|
||||||
|
0.012407575745987933,0.10201566081383735,344.0291979312897,5,68.71350336074829,1765123575,,False,1,b0dda58b,2025-12-07_17-06-15,364.82790350914,364.82790350914,9732,LAPTOP-2OQK6GT5,127.0.0.1,364.82790350914,1,False,False,True,0.3149521989502957,0.11783753596277924,0.0,0.6825729339913746,b0dda58b
|
||||||
|
0.012429753445092291,0.10205118268939237,346.11818265914917,5,69.12530856132507,1765123581,,False,1,e9d40333,2025-12-07_17-06-21,365.62638425827026,365.62638425827026,23416,LAPTOP-2OQK6GT5,127.0.0.1,365.62638425827026,1,False,False,True,0.5302520310849914,0.1569390945373281,0.0,0.10019443545563994,e9d40333
|
||||||
|
0.011990675508758594,0.10047637953978608,346.5398359298706,5,69.2183114528656,1765123948,,False,1,aa89fe7a,2025-12-07_17-12-28,366.7530257701874,366.7530257701874,16200,LAPTOP-2OQK6GT5,127.0.0.1,366.7530257701874,1,False,False,True,0.5039700850900125,0.16208277029791282,0.0,0.6765386284546205,aa89fe7a
|
||||||
|
0.011968497809654236,0.10044085766423105,345.97880601882935,5,69.09321279525757,1765123951,,False,1,92c48d07,2025-12-07_17-12-31,365.0942301750183,365.0942301750183,15432,LAPTOP-2OQK6GT5,127.0.0.1,365.0942301750183,1,False,False,True,0.33321916406589397,0.1864428656555301,0.0,0.6775297319325386,92c48d07
|
||||||
|
0.011968497809654236,0.10044085766423105,344.1725525856018,5,68.74226913452148,1765124318,,False,1,187790d7,2025-12-07_17-18-38,364.47401189804077,364.47401189804077,24676,LAPTOP-2OQK6GT5,127.0.0.1,364.47401189804077,1,False,False,True,0.3372505528404193,0.2352515935896671,0.0,0.6987321324340134,187790d7
|
||||||
|
0.011760127958326316,0.09964993325879434,345.9427492618561,5,69.08389501571655,1765124322,,False,1,442a2439,2025-12-07_17-18-42,364.755074262619,364.755074262619,7892,LAPTOP-2OQK6GT5,127.0.0.1,364.755074262619,1,False,False,True,0.5098036701758629,0.2122757290966333,0.0,0.6992468303721803,442a2439
|
||||||
|
0.011968497809654236,0.10044085766423105,345.40264558792114,5,68.98561010360717,1765124689,,False,1,70862adc,2025-12-07_17-24-49,365.9752175807953,365.9752175807953,15412,LAPTOP-2OQK6GT5,127.0.0.1,365.9752175807953,1,False,False,True,0.3963969237347287,0.2163058925653838,0.0,0.6859176720785957,70862adc
|
||||||
|
0.012407575745987933,0.10201566081383735,345.8808228969574,5,69.07736506462098,1765124693,,False,1,e6821f34,2025-12-07_17-24-53,365.25493717193604,365.25493717193604,26088,LAPTOP-2OQK6GT5,127.0.0.1,365.25493717193604,1,False,False,True,0.3668982772069688,0.2407751620351906,0.0,0.5737620270733486,e6821f34
|
||||||
|
0.012199205894660016,0.10122473640840064,347.05629682540894,5,69.31870231628417,1765125062,,False,1,8b680875,2025-12-07_17-31-02,367.2029130458832,367.2029130458832,1720,LAPTOP-2OQK6GT5,127.0.0.1,367.2029130458832,1,False,False,True,0.5312495877753942,0.3193426688929859,0.0,0.591252589724218,8b680875
|
||||||
|
0.012429753445092291,0.10205118268939237,349.60691928863525,5,69.8253363609314,1765125068,,False,1,fc54867b,2025-12-07_17-31-08,368.73608803749084,368.73608803749084,4888,LAPTOP-2OQK6GT5,127.0.0.1,368.73608803749084,1,False,False,True,0.5034080657304706,0.3042864908472832,0.0,0.5024906014323391,fc54867b
|
||||||
|
0.013385453418768206,0.10927323740570172,343.8553657531738,5,68.67559289932251,1765125432,,False,1,c32d0d5e,2025-12-07_17-37-12,364.42339730262756,364.42339730262756,25808,LAPTOP-2OQK6GT5,127.0.0.1,364.42339730262756,1,False,False,True,0.15300672154002157,0.39848899797721926,0.0,0.5167681121564286,c32d0d5e
|
||||||
|
0.013537204772521452,0.10852488053708713,344.60119009017944,5,68.81447420120239,1765125436,,False,1,4762fbbb,2025-12-07_17-37-16,363.3258783817291,363.3258783817291,20760,LAPTOP-2OQK6GT5,127.0.0.1,363.3258783817291,1,False,False,True,0.13342603167575784,0.4010104919178914,0.0,0.618812411626611,4762fbbb
|
||||||
|
0.011763789518968464,0.09968897796498292,344.03784108161926,5,68.71829047203065,1765125803,,False,1,522ac97c,2025-12-07_17-43-23,364.7200028896332,364.7200028896332,2372,LAPTOP-2OQK6GT5,127.0.0.1,364.7200028896332,1,False,False,True,0.4489762005319642,0.402754966715804,0.0,0.6426372526242771,522ac97c
|
||||||
|
0.011650346524073398,0.09890157639017978,343.51321721076965,5,68.60030875205993,1765125805,,False,1,5784f433,2025-12-07_17-43-25,362.93026328086853,362.93026328086853,22900,LAPTOP-2OQK6GT5,127.0.0.1,362.93026328086853,1,False,False,True,0.46204975067512033,0.192768833446102,0.0,0.6328281433384326,5784f433
|
||||||
|
0.011650346524073398,0.09890157639017978,343.80972242355347,5,68.66908102035522,1765126172,,False,1,83af0528,2025-12-07_17-49-32,364.5850279331207,364.5850279331207,9832,LAPTOP-2OQK6GT5,127.0.0.1,364.5850279331207,1,False,False,True,0.4663139585990712,0.1845869678485352,0.0,0.6299207399141384,83af0528
|
||||||
|
0.011650346524073398,0.09890157639017978,344.11421155929565,5,68.72400512695313,1765126177,,False,1,12cbaa22,2025-12-07_17-49-37,364.24684858322144,364.24684858322144,5968,LAPTOP-2OQK6GT5,127.0.0.1,364.24684858322144,1,False,False,True,0.47277853181431145,0.40562176755388546,0.0,0.6314990057451438,12cbaa22
|
||||||
|
0.011763789518968464,0.09968897796498292,348.5801889896393,5,69.61860737800598,1765126547,,False,1,a3a87765,2025-12-07_17-55-47,369.27432322502136,369.27432322502136,24372,LAPTOP-2OQK6GT5,127.0.0.1,369.27432322502136,1,False,False,True,0.45010042945259804,0.2855696990924951,0.0,0.6351522397620386,a3a87765
|
||||||
|
0.0441989903761154,0.13204740781578367,347.0340585708618,5,69.31097078323364,1765126548,,False,1,cf2bad0c,2025-12-07_17-55-48,366.1882207393646,366.1882207393646,3272,LAPTOP-2OQK6GT5,127.0.0.1,366.1882207393646,1,False,False,False,0.5890116605741096,0.283660909026841,0.0,0.4602911956047037,cf2bad0c
|
||||||
|
0.0441989903761154,0.13204740781578367,343.53946828842163,5,68.61563892364502,1765126916,,False,1,9a9b91e7,2025-12-07_18-01-56,364.0171241760254,364.0171241760254,2272,LAPTOP-2OQK6GT5,127.0.0.1,364.0171241760254,1,False,False,False,0.6089594786916612,0.3646091181984181,0.0,0.46522499154449626,9a9b91e7
|
||||||
|
0.012199205894660016,0.10122473640840064,345.76200914382935,5,69.05782113075256,1765126922,,False,1,e326d901,2025-12-07_18-02-02,365.42848086357117,365.42848086357117,24932,LAPTOP-2OQK6GT5,127.0.0.1,365.42848086357117,1,False,False,True,0.5932289185132622,0.37353729921136775,0.0,0.46368845919414936,e326d901
|
||||||
|
0.011990281344944778,0.09910429396546264,344.40758872032166,5,68.7896653175354,1765127287,,False,1,ccb3f19a,2025-12-07_18-08-07,365.1469933986664,365.1469933986664,1104,LAPTOP-2OQK6GT5,127.0.0.1,365.1469933986664,1,True,False,True,0.6866411603181266,0.4537774266698106,0.0,0.3059281770286948,ccb3f19a
|
||||||
|
0.012186205997500013,0.1012282592390342,343.9386422634125,5,68.69270787239074,1765127290,,False,1,8c12c55f,2025-12-07_18-08-10,363.29733777046204,363.29733777046204,19700,LAPTOP-2OQK6GT5,127.0.0.1,363.29733777046204,1,True,False,True,0.6710404650258701,0.44441637238072235,0.0,0.2641320116724262,8c12c55f
|
||||||
|
0.0662709141213666,0.16851508812176408,359.4665718078613,5,71.7971097946167,1765127672,,False,1,5a62d5b6,2025-12-07_18-14-32,380.3328058719635,380.3328058719635,26528,LAPTOP-2OQK6GT5,127.0.0.1,380.3328058719635,1,True,True,True,0.40414134317929745,0.2010474655405967,0.0,0.59925716647257,5a62d5b6
|
||||||
|
0.07070075496425433,0.17390976502682037,356.3221182823181,5,71.16437225341797,1765127673,,False,1,bb4495b7,2025-12-07_18-14-33,375.9771683216095,375.9771683216095,21772,LAPTOP-2OQK6GT5,127.0.0.1,375.9771683216095,1,False,True,False,0.39073713326110354,0.5764393142467112,0.0,0.5413963334094041,bb4495b7
|
||||||
|
0.01153507274885726,0.09890157639017978,344.71807885169983,5,68.8583309173584,1765128044,,False,1,9d90711d,2025-12-07_18-20-44,365.7700536251068,365.7700536251068,17592,LAPTOP-2OQK6GT5,127.0.0.1,365.7700536251068,1,False,False,True,0.46895437796002276,0.5411583003121286,0.0,0.6350154738477746,9d90711d
|
||||||
|
0.01153507274885726,0.09890157639017978,343.69704604148865,5,68.64236354827881,1765128046,,False,1,daaec3f8,2025-12-07_18-20-46,363.0186264514923,363.0186264514923,21292,LAPTOP-2OQK6GT5,127.0.0.1,363.0186264514923,1,False,False,True,0.4743507729816579,0.5213407674549528,0.0,0.6445669851749475,daaec3f8
|
||||||
|
0.01153507274885726,0.09890157639017978,343.6039113998413,5,68.62933912277222,1765128413,,False,1,51fb5915,2025-12-07_18-26-53,364.0196588039398,364.0196588039398,21772,LAPTOP-2OQK6GT5,127.0.0.1,364.0196588039398,1,False,False,True,0.48541186574386475,0.5810500215434935,0.0,0.6463595394763801,51fb5915
|
||||||
|
0.01164485418311018,0.09964993325879434,344.2613036632538,5,68.75940155982971,1765128417,,False,1,18966a33,2025-12-07_18-26-57,363.3374502658844,363.3374502658844,16900,LAPTOP-2OQK6GT5,127.0.0.1,363.3374502658844,1,False,False,True,0.5501591363807381,0.5132901504443755,0.0,0.6489815927562321,18966a33
|
||||||
|
0.012314479669876154,0.10205118268939237,345.49542331695557,5,69.01211080551147,1765128785,,False,1,b67080f9,2025-12-07_18-33-05,366.01860308647156,366.01860308647156,20948,LAPTOP-2OQK6GT5,127.0.0.1,366.01860308647156,1,False,False,True,0.5534122098827526,0.5760738874546728,0.0,0.5609719434431071,b67080f9
|
||||||
|
0.07209115365923097,0.17918874278969218,351.96662616729736,5,70.29538555145264,1765128795,,False,1,2533f368,2025-12-07_18-33-15,371.205295085907,371.205295085907,11208,LAPTOP-2OQK6GT5,127.0.0.1,371.205295085907,1,False,True,True,0.5572268058153711,0.5246075332847907,0.0,0.558307419246103,2533f368
|
||||||
|
0.06479949428557605,0.16493742503085831,357.1695992946625,5,71.33717932701111,1765129169,,False,1,451d018d,2025-12-07_18-39-29,378.8273491859436,378.8273491859436,3616,LAPTOP-2OQK6GT5,127.0.0.1,378.8273491859436,1,False,True,False,0.6340187369543626,0.5494644274379972,0.0,0.6521052525663952,451d018d
|
||||||
|
0.04429208645222718,0.13283833222122038,349.41683983802795,5,69.77591800689697,1765129169,,False,1,2256e752,2025-12-07_18-39-29,369.8801362514496,369.8801362514496,25468,LAPTOP-2OQK6GT5,127.0.0.1,369.8801362514496,1,True,False,False,0.6478037819045206,0.6228629446714814,0.0,0.6546094515631737,2256e752
|
||||||
|
0.012292301970771797,0.10201566081383735,346.071848154068,5,69.12432713508606,1765129542,,False,1,0a892729,2025-12-07_18-45-42,367.237042427063,367.237042427063,26212,LAPTOP-2OQK6GT5,127.0.0.1,367.237042427063,1,False,False,True,0.42173310551322135,0.542928875009614,0.0,0.601586841052583,0a892729
|
||||||
|
0.012292301970771797,0.10201566081383735,346.42522287368774,5,69.19188222885131,1765129545,,False,1,495075f5,2025-12-07_18-45-45,365.53574872016907,365.53574872016907,23604,LAPTOP-2OQK6GT5,127.0.0.1,365.53574872016907,1,False,False,True,0.4186754897467695,0.6318747444402091,0.0,0.5956181518703515,495075f5
|
||||||
|
0.011974150685190959,0.10047637953978608,346.9409854412079,5,69.29810705184937,1765129915,,False,1,54c45552,2025-12-07_18-51-55,367.9469211101532,367.9469211101532,25352,LAPTOP-2OQK6GT5,127.0.0.1,367.9469211101532,1,False,False,True,0.46382270850905233,0.6196868829200468,0.0,0.6126115785559785,54c45552
|
||||||
|
0.011974150685190959,0.10047637953978608,346.4141414165497,5,69.18586716651916,1765129917,,False,1,6b2e9b93,2025-12-07_18-51-57,365.9887709617615,365.9887709617615,25400,LAPTOP-2OQK6GT5,127.0.0.1,365.9887709617615,1,False,False,True,0.4751854264500806,0.48925010555288895,0.0,0.515482483148412,6b2e9b93
|
||||||
|
0.01153507274885726,0.09890157639017978,346.25940680503845,5,69.15517511367798,1765130288,,False,1,e9a6b81f,2025-12-07_18-58-08,367.33222007751465,367.33222007751465,4036,LAPTOP-2OQK6GT5,127.0.0.1,367.33222007751465,1,False,False,True,0.4879296810791008,0.4925520261481197,0.0,0.6483489622744677,e9a6b81f
|
||||||
|
0.01153507274885726,0.09890157639017978,345.8425042629242,5,69.06782102584839,1765130290,,False,1,076c5450,2025-12-07_18-58-10,365.1877450942993,365.1877450942993,4832,LAPTOP-2OQK6GT5,127.0.0.1,365.1877450942993,1,False,False,True,0.48842171509426413,0.5881329256041945,0.0,0.6569193185887352,076c5450
|
||||||
|
0.011875401733542455,0.10047637953978608,350.2443346977234,5,69.94839100837707,1765130664,,False,1,4a42a3ea,2025-12-07_19-04-24,370.9968421459198,370.9968421459198,14912,LAPTOP-2OQK6GT5,127.0.0.1,370.9968421459198,1,False,False,True,0.5590357657789103,0.5940413385819063,0.0,0.6573225721220606,4a42a3ea
|
||||||
|
0.012080110024228227,0.10047637953978608,351.5000901222229,5,70.19009194374084,1765130669,,False,1,041795f1,2025-12-07_19-04-29,370.946097612381,370.946097612381,22372,LAPTOP-2OQK6GT5,127.0.0.1,370.946097612381,1,False,False,True,0.5650092236486315,0.6617440972899422,0.0,0.6629504776006702,041795f1
|
||||||
|
0.012314479669876154,0.10205118268939237,343.53907656669617,5,68.6134319782257,1765131035,,False,1,8abb3f37,2025-12-07_19-10-35,364.67463064193726,364.67463064193726,22012,LAPTOP-2OQK6GT5,127.0.0.1,364.67463064193726,1,False,False,True,0.48982107744168,0.4636820835063238,0.0,0.39458266779240964,8abb3f37
|
||||||
|
0.012314479669876154,0.10205118268939237,345.5919795036316,5,69.02381987571717,1765131040,,False,1,f2cb682e,2025-12-07_19-10-40,364.90754437446594,364.90754437446594,5752,LAPTOP-2OQK6GT5,127.0.0.1,364.90754437446594,1,True,False,True,0.4917954659583112,0.45224829356708557,0.0,0.42597097228928366,f2cb682e
|
||||||
|
0.012314479669876154,0.10205118268939237,349.50936698913574,5,69.80772981643676,1765131411,,False,1,463fe5e7,2025-12-07_19-16-51,370.56375885009766,370.56375885009766,16524,LAPTOP-2OQK6GT5,127.0.0.1,370.56375885009766,1,True,False,True,0.5373435635563055,0.5202382560972127,0.0,0.5340573143597149,463fe5e7
|
||||||
|
0.012083932119443879,0.10122473640840064,350.1439118385315,5,69.92809920310974,1765131415,,False,1,88bbe87d,2025-12-07_19-16-55,369.54999685287476,369.54999685287476,15084,LAPTOP-2OQK6GT5,127.0.0.1,369.54999685287476,1,False,False,True,0.5274586910866753,0.5110782288617315,0.0,0.5368958272648865,88bbe87d
|
||||||
|
0.011875401733542455,0.10047637953978608,355.52406072616577,5,71.00808920860291,1765131794,,False,1,33ea1cc6,2025-12-07_19-23-14,376.746440410614,376.746440410614,17380,LAPTOP-2OQK6GT5,127.0.0.1,376.746440410614,1,False,False,True,0.5229924883346121,0.5158065672775711,0.0,0.6679657240993034,33ea1cc6
|
||||||
|
0.011853224034438097,0.10044085766423105,355.67893862724304,5,71.0243070602417,1765131797,,False,1,1243723e,2025-12-07_19-23-17,375.44413685798645,375.44413685798645,11232,LAPTOP-2OQK6GT5,127.0.0.1,375.44413685798645,1,False,False,True,0.3726772055073363,0.5573152713604742,0.0,0.6766134238094554,1243723e
|
||||||
|
BIN
thesis_output/figures/figura_1.png
Normal file
|
After Width: | Height: | Size: 19 KiB |
BIN
thesis_output/figures/figura_2.png
Normal file
|
After Width: | Height: | Size: 19 KiB |
BIN
thesis_output/figures/figura_3.png
Normal file
|
After Width: | Height: | Size: 16 KiB |
BIN
thesis_output/figures/figura_4.png
Normal file
|
After Width: | Height: | Size: 37 KiB |
BIN
thesis_output/figures/figura_5.png
Normal file
|
After Width: | Height: | Size: 29 KiB |
BIN
thesis_output/figures/figura_6.png
Normal file
|
After Width: | Height: | Size: 16 KiB |
BIN
thesis_output/figures/figura_7.png
Normal file
|
After Width: | Height: | Size: 18 KiB |
BIN
thesis_output/figures/figura_8.png
Normal file
|
After Width: | Height: | Size: 44 KiB |
42
thesis_output/figures/figures_manifest.json
Normal file
@@ -0,0 +1,42 @@
|
|||||||
|
[
|
||||||
|
{
|
||||||
|
"file": "figura_1.png",
|
||||||
|
"title": "Pipeline de un sistema OCR moderno",
|
||||||
|
"index": 1
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"file": "figura_2.png",
|
||||||
|
"title": "Ciclo de optimización con Ray Tune y Optuna",
|
||||||
|
"index": 2
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"file": "figura_3.png",
|
||||||
|
"title": "Fases de la metodología experimental",
|
||||||
|
"index": 3
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"file": "figura_4.png",
|
||||||
|
"title": "Estructura del dataset de evaluación",
|
||||||
|
"index": 4
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"file": "figura_5.png",
|
||||||
|
"title": "Arquitectura de ejecución con subprocesos",
|
||||||
|
"index": 5
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"file": "figura_6.png",
|
||||||
|
"title": "Impacto de textline_orientation en CER",
|
||||||
|
"index": 6
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"file": "figura_7.png",
|
||||||
|
"title": "Comparación Baseline vs Optimizado (24 páginas)",
|
||||||
|
"index": 7
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"file": "figura_8.png",
|
||||||
|
"title": "Estructura del repositorio del proyecto",
|
||||||
|
"index": 8
|
||||||
|
}
|
||||||
|
]
|
||||||
BIN
thesis_output/plantilla_individual.htm
Normal file
@@ -0,0 +1,2 @@
|
|||||||
|
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
|
||||||
|
<a:clrMap xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main" bg1="lt1" tx1="dk1" bg2="lt2" tx2="dk2" accent1="accent1" accent2="accent2" accent3="accent3" accent4="accent4" accent5="accent5" accent6="accent6" hlink="hlink" folHlink="folHlink"/>
|
||||||
15
thesis_output/plantilla_individual_files/filelist.xml
Normal file
@@ -0,0 +1,15 @@
|
|||||||
|
<xml xmlns:o="urn:schemas-microsoft-com:office:office">
|
||||||
|
<o:MainFile HRef="../plantilla_individual.htm"/>
|
||||||
|
<o:File HRef="item0013.xml"/>
|
||||||
|
<o:File HRef="props014.xml"/>
|
||||||
|
<o:File HRef="item0015.xml"/>
|
||||||
|
<o:File HRef="props016.xml"/>
|
||||||
|
<o:File HRef="item0017.xml"/>
|
||||||
|
<o:File HRef="props018.xml"/>
|
||||||
|
<o:File HRef="themedata.thmx"/>
|
||||||
|
<o:File HRef="colorschememapping.xml"/>
|
||||||
|
<o:File HRef="image001.png"/>
|
||||||
|
<o:File HRef="image002.gif"/>
|
||||||
|
<o:File HRef="header.htm"/>
|
||||||
|
<o:File HRef="filelist.xml"/>
|
||||||
|
</xml>
|
||||||
BIN
thesis_output/plantilla_individual_files/header.htm
Normal file
BIN
thesis_output/plantilla_individual_files/image001.png
Normal file
|
After Width: | Height: | Size: 10 KiB |
BIN
thesis_output/plantilla_individual_files/image002.gif
Normal file
|
After Width: | Height: | Size: 3.9 KiB |
BIN
thesis_output/plantilla_individual_files/image003.gif
Normal file
|
After Width: | Height: | Size: 3.9 KiB |
BIN
thesis_output/plantilla_individual_files/image003.png
Normal file
|
After Width: | Height: | Size: 23 KiB |
BIN
thesis_output/plantilla_individual_files/image004.jpg
Normal file
|
After Width: | Height: | Size: 16 KiB |
BIN
thesis_output/plantilla_individual_files/image005.png
Normal file
|
After Width: | Height: | Size: 13 KiB |
BIN
thesis_output/plantilla_individual_files/image006.gif
Normal file
|
After Width: | Height: | Size: 25 KiB |
258
thesis_output/plantilla_individual_files/item0001.xml
Normal file
@@ -0,0 +1,258 @@
|
|||||||
|
<?xml version="1.0" encoding="utf-8"?><ct:contentTypeSchema ct:_="" ma:_="" ma:contentTypeName="Documento" ma:contentTypeID="0x010100DF3D7C797EA12745A270EF30E38719B9" ma:contentTypeVersion="19" ma:contentTypeDescription="Crear nuevo documento." ma:contentTypeScope="" ma:versionID="227b02526234ef39b0b78895a9d90cf5" xmlns:ct="http://schemas.microsoft.com/office/2006/metadata/contentType" xmlns:ma="http://schemas.microsoft.com/office/2006/metadata/properties/metaAttributes">
|
||||||
|
<xsd:schema targetNamespace="http://schemas.microsoft.com/office/2006/metadata/properties" ma:root="true" ma:fieldsID="3c939c8607e2f594db8bbb23634dd059" ns2:_="" ns3:_="" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:p="http://schemas.microsoft.com/office/2006/metadata/properties" xmlns:ns2="0a70e875-3d35-4be2-921f-7117c31bab9b" xmlns:ns3="27c1adeb-3674-457c-b08c-8a73f31b6e23">
|
||||||
|
<xsd:import namespace="0a70e875-3d35-4be2-921f-7117c31bab9b"/>
|
||||||
|
<xsd:import namespace="27c1adeb-3674-457c-b08c-8a73f31b6e23"/>
|
||||||
|
<xsd:element name="properties">
|
||||||
|
<xsd:complexType>
|
||||||
|
<xsd:sequence>
|
||||||
|
<xsd:element name="documentManagement">
|
||||||
|
<xsd:complexType>
|
||||||
|
<xsd:all>
|
||||||
|
<xsd:element ref="ns2:SharedWithUsers" minOccurs="0"/>
|
||||||
|
<xsd:element ref="ns2:SharedWithDetails" minOccurs="0"/>
|
||||||
|
<xsd:element ref="ns3:MediaServiceMetadata" minOccurs="0"/>
|
||||||
|
<xsd:element ref="ns3:MediaServiceFastMetadata" minOccurs="0"/>
|
||||||
|
<xsd:element ref="ns3:MediaServiceAutoKeyPoints" minOccurs="0"/>
|
||||||
|
<xsd:element ref="ns3:MediaServiceKeyPoints" minOccurs="0"/>
|
||||||
|
<xsd:element ref="ns3:MediaServiceAutoTags" minOccurs="0"/>
|
||||||
|
<xsd:element ref="ns3:MediaServiceOCR" minOccurs="0"/>
|
||||||
|
<xsd:element ref="ns3:MediaServiceGenerationTime" minOccurs="0"/>
|
||||||
|
<xsd:element ref="ns3:MediaServiceEventHashCode" minOccurs="0"/>
|
||||||
|
<xsd:element ref="ns3:MediaServiceDateTaken" minOccurs="0"/>
|
||||||
|
<xsd:element ref="ns3:MediaLengthInSeconds" minOccurs="0"/>
|
||||||
|
<xsd:element ref="ns3:MediaServiceLocation" minOccurs="0"/>
|
||||||
|
<xsd:element ref="ns3:lcf76f155ced4ddcb4097134ff3c332f" minOccurs="0"/>
|
||||||
|
<xsd:element ref="ns2:TaxCatchAll" minOccurs="0"/>
|
||||||
|
<xsd:element ref="ns3:MediaServiceSearchProperties" minOccurs="0"/>
|
||||||
|
<xsd:element ref="ns3:_Flow_SignoffStatus" minOccurs="0"/>
|
||||||
|
<xsd:element ref="ns3:MediaServiceObjectDetectorVersions" minOccurs="0"/>
|
||||||
|
</xsd:all>
|
||||||
|
</xsd:complexType>
|
||||||
|
</xsd:element>
|
||||||
|
</xsd:sequence>
|
||||||
|
</xsd:complexType>
|
||||||
|
</xsd:element>
|
||||||
|
</xsd:schema>
|
||||||
|
<xsd:schema targetNamespace="0a70e875-3d35-4be2-921f-7117c31bab9b" elementFormDefault="qualified" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:dms="http://schemas.microsoft.com/office/2006/documentManagement/types" xmlns:pc="http://schemas.microsoft.com/office/infopath/2007/PartnerControls">
|
||||||
|
<xsd:import namespace="http://schemas.microsoft.com/office/2006/documentManagement/types"/>
|
||||||
|
<xsd:import namespace="http://schemas.microsoft.com/office/infopath/2007/PartnerControls"/>
|
||||||
|
<xsd:element name="SharedWithUsers" ma:index="8" nillable="true" ma:displayName="Compartido con" ma:internalName="SharedWithUsers" ma:readOnly="true">
|
||||||
|
<xsd:complexType>
|
||||||
|
<xsd:complexContent>
|
||||||
|
<xsd:extension base="dms:UserMulti">
|
||||||
|
<xsd:sequence>
|
||||||
|
<xsd:element name="UserInfo" minOccurs="0" maxOccurs="unbounded">
|
||||||
|
<xsd:complexType>
|
||||||
|
<xsd:sequence>
|
||||||
|
<xsd:element name="DisplayName" type="xsd:string" minOccurs="0"/>
|
||||||
|
<xsd:element name="AccountId" type="dms:UserId" minOccurs="0" nillable="true"/>
|
||||||
|
<xsd:element name="AccountType" type="xsd:string" minOccurs="0"/>
|
||||||
|
</xsd:sequence>
|
||||||
|
</xsd:complexType>
|
||||||
|
</xsd:element>
|
||||||
|
</xsd:sequence>
|
||||||
|
</xsd:extension>
|
||||||
|
</xsd:complexContent>
|
||||||
|
</xsd:complexType>
|
||||||
|
</xsd:element>
|
||||||
|
<xsd:element name="SharedWithDetails" ma:index="9" nillable="true" ma:displayName="Detalles de uso compartido" ma:internalName="SharedWithDetails" ma:readOnly="true">
|
||||||
|
<xsd:simpleType>
|
||||||
|
<xsd:restriction base="dms:Note">
|
||||||
|
<xsd:maxLength value="255"/>
|
||||||
|
</xsd:restriction>
|
||||||
|
</xsd:simpleType>
|
||||||
|
</xsd:element>
|
||||||
|
<xsd:element name="TaxCatchAll" ma:index="23" nillable="true" ma:displayName="Taxonomy Catch All Column" ma:hidden="true" ma:list="{c7f67346-78c9-4c4d-b954-8d350fdf60db}" ma:internalName="TaxCatchAll" ma:showField="CatchAllData" ma:web="0a70e875-3d35-4be2-921f-7117c31bab9b">
|
||||||
|
<xsd:complexType>
|
||||||
|
<xsd:complexContent>
|
||||||
|
<xsd:extension base="dms:MultiChoiceLookup">
|
||||||
|
<xsd:sequence>
|
||||||
|
<xsd:element name="Value" type="dms:Lookup" maxOccurs="unbounded" minOccurs="0" nillable="true"/>
|
||||||
|
</xsd:sequence>
|
||||||
|
</xsd:extension>
|
||||||
|
</xsd:complexContent>
|
||||||
|
</xsd:complexType>
|
||||||
|
</xsd:element>
|
||||||
|
</xsd:schema>
|
||||||
|
<xsd:schema targetNamespace="27c1adeb-3674-457c-b08c-8a73f31b6e23" elementFormDefault="qualified" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:dms="http://schemas.microsoft.com/office/2006/documentManagement/types" xmlns:pc="http://schemas.microsoft.com/office/infopath/2007/PartnerControls">
|
||||||
|
<xsd:import namespace="http://schemas.microsoft.com/office/2006/documentManagement/types"/>
|
||||||
|
<xsd:import namespace="http://schemas.microsoft.com/office/infopath/2007/PartnerControls"/>
|
||||||
|
<xsd:element name="MediaServiceMetadata" ma:index="10" nillable="true" ma:displayName="MediaServiceMetadata" ma:hidden="true" ma:internalName="MediaServiceMetadata" ma:readOnly="true">
|
||||||
|
<xsd:simpleType>
|
||||||
|
<xsd:restriction base="dms:Note"/>
|
||||||
|
</xsd:simpleType>
|
||||||
|
</xsd:element>
|
||||||
|
<xsd:element name="MediaServiceFastMetadata" ma:index="11" nillable="true" ma:displayName="MediaServiceFastMetadata" ma:hidden="true" ma:internalName="MediaServiceFastMetadata" ma:readOnly="true">
|
||||||
|
<xsd:simpleType>
|
||||||
|
<xsd:restriction base="dms:Note"/>
|
||||||
|
</xsd:simpleType>
|
||||||
|
</xsd:element>
|
||||||
|
<xsd:element name="MediaServiceAutoKeyPoints" ma:index="12" nillable="true" ma:displayName="MediaServiceAutoKeyPoints" ma:hidden="true" ma:internalName="MediaServiceAutoKeyPoints" ma:readOnly="true">
|
||||||
|
<xsd:simpleType>
|
||||||
|
<xsd:restriction base="dms:Note"/>
|
||||||
|
</xsd:simpleType>
|
||||||
|
</xsd:element>
|
||||||
|
<xsd:element name="MediaServiceKeyPoints" ma:index="13" nillable="true" ma:displayName="KeyPoints" ma:internalName="MediaServiceKeyPoints" ma:readOnly="true">
|
||||||
|
<xsd:simpleType>
|
||||||
|
<xsd:restriction base="dms:Note">
|
||||||
|
<xsd:maxLength value="255"/>
|
||||||
|
</xsd:restriction>
|
||||||
|
</xsd:simpleType>
|
||||||
|
</xsd:element>
|
||||||
|
<xsd:element name="MediaServiceAutoTags" ma:index="14" nillable="true" ma:displayName="Tags" ma:internalName="MediaServiceAutoTags" ma:readOnly="true">
|
||||||
|
<xsd:simpleType>
|
||||||
|
<xsd:restriction base="dms:Text"/>
|
||||||
|
</xsd:simpleType>
|
||||||
|
</xsd:element>
|
||||||
|
<xsd:element name="MediaServiceOCR" ma:index="15" nillable="true" ma:displayName="Extracted Text" ma:internalName="MediaServiceOCR" ma:readOnly="true">
|
||||||
|
<xsd:simpleType>
|
||||||
|
<xsd:restriction base="dms:Note">
|
||||||
|
<xsd:maxLength value="255"/>
|
||||||
|
</xsd:restriction>
|
||||||
|
</xsd:simpleType>
|
||||||
|
</xsd:element>
|
||||||
|
<xsd:element name="MediaServiceGenerationTime" ma:index="16" nillable="true" ma:displayName="MediaServiceGenerationTime" ma:hidden="true" ma:internalName="MediaServiceGenerationTime" ma:readOnly="true">
|
||||||
|
<xsd:simpleType>
|
||||||
|
<xsd:restriction base="dms:Text"/>
|
||||||
|
</xsd:simpleType>
|
||||||
|
</xsd:element>
|
||||||
|
<xsd:element name="MediaServiceEventHashCode" ma:index="17" nillable="true" ma:displayName="MediaServiceEventHashCode" ma:hidden="true" ma:internalName="MediaServiceEventHashCode" ma:readOnly="true">
|
||||||
|
<xsd:simpleType>
|
||||||
|
<xsd:restriction base="dms:Text"/>
|
||||||
|
</xsd:simpleType>
|
||||||
|
</xsd:element>
|
||||||
|
<xsd:element name="MediaServiceDateTaken" ma:index="18" nillable="true" ma:displayName="MediaServiceDateTaken" ma:hidden="true" ma:internalName="MediaServiceDateTaken" ma:readOnly="true">
|
||||||
|
<xsd:simpleType>
|
||||||
|
<xsd:restriction base="dms:Text"/>
|
||||||
|
</xsd:simpleType>
|
||||||
|
</xsd:element>
|
||||||
|
<xsd:element name="MediaLengthInSeconds" ma:index="19" nillable="true" ma:displayName="Length (seconds)" ma:internalName="MediaLengthInSeconds" ma:readOnly="true">
|
||||||
|
<xsd:simpleType>
|
||||||
|
<xsd:restriction base="dms:Unknown"/>
|
||||||
|
</xsd:simpleType>
|
||||||
|
</xsd:element>
|
||||||
|
<xsd:element name="MediaServiceLocation" ma:index="20" nillable="true" ma:displayName="Location" ma:internalName="MediaServiceLocation" ma:readOnly="true">
|
||||||
|
<xsd:simpleType>
|
||||||
|
<xsd:restriction base="dms:Text"/>
|
||||||
|
</xsd:simpleType>
|
||||||
|
</xsd:element>
|
||||||
|
<xsd:element name="lcf76f155ced4ddcb4097134ff3c332f" ma:index="22" nillable="true" ma:taxonomy="true" ma:internalName="lcf76f155ced4ddcb4097134ff3c332f" ma:taxonomyFieldName="MediaServiceImageTags" ma:displayName="Etiquetas de imagen" ma:readOnly="false" ma:fieldId="{5cf76f15-5ced-4ddc-b409-7134ff3c332f}" ma:taxonomyMulti="true" ma:sspId="17631b59-e624-4eb7-963c-219f14f887a3" ma:termSetId="09814cd3-568e-fe90-9814-8d621ff8fb84" ma:anchorId="fba54fb3-c3e1-fe81-a776-ca4b69148c4d" ma:open="true" ma:isKeyword="false">
|
||||||
|
<xsd:complexType>
|
||||||
|
<xsd:sequence>
|
||||||
|
<xsd:element ref="pc:Terms" minOccurs="0" maxOccurs="1"></xsd:element>
|
||||||
|
</xsd:sequence>
|
||||||
|
</xsd:complexType>
|
||||||
|
</xsd:element>
|
||||||
|
<xsd:element name="MediaServiceSearchProperties" ma:index="24" nillable="true" ma:displayName="MediaServiceSearchProperties" ma:hidden="true" ma:internalName="MediaServiceSearchProperties" ma:readOnly="true">
|
||||||
|
<xsd:simpleType>
|
||||||
|
<xsd:restriction base="dms:Note"/>
|
||||||
|
</xsd:simpleType>
|
||||||
|
</xsd:element>
|
||||||
|
<xsd:element name="_Flow_SignoffStatus" ma:index="25" nillable="true" ma:displayName="Estado de aprobación" ma:internalName="Estado_x0020_de_x0020_aprobaci_x00f3_n">
|
||||||
|
<xsd:simpleType>
|
||||||
|
<xsd:restriction base="dms:Text"/>
|
||||||
|
</xsd:simpleType>
|
||||||
|
</xsd:element>
|
||||||
|
<xsd:element name="MediaServiceObjectDetectorVersions" ma:index="26" nillable="true" ma:displayName="MediaServiceObjectDetectorVersions" ma:description="" ma:hidden="true" ma:indexed="true" ma:internalName="MediaServiceObjectDetectorVersions" ma:readOnly="true">
|
||||||
|
<xsd:simpleType>
|
||||||
|
<xsd:restriction base="dms:Text"/>
|
||||||
|
</xsd:simpleType>
|
||||||
|
</xsd:element>
|
||||||
|
</xsd:schema>
|
||||||
|
<xsd:schema targetNamespace="http://schemas.openxmlformats.org/package/2006/metadata/core-properties" elementFormDefault="qualified" attributeFormDefault="unqualified" blockDefault="#all" xmlns="http://schemas.openxmlformats.org/package/2006/metadata/core-properties" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:odoc="http://schemas.microsoft.com/internal/obd">
|
||||||
|
<xsd:import namespace="http://purl.org/dc/elements/1.1/" schemaLocation="http://dublincore.org/schemas/xmls/qdc/2003/04/02/dc.xsd"/>
|
||||||
|
<xsd:import namespace="http://purl.org/dc/terms/" schemaLocation="http://dublincore.org/schemas/xmls/qdc/2003/04/02/dcterms.xsd"/>
|
||||||
|
<xsd:element name="coreProperties" type="CT_coreProperties"/>
|
||||||
|
<xsd:complexType name="CT_coreProperties">
|
||||||
|
<xsd:all>
|
||||||
|
<xsd:element ref="dc:creator" minOccurs="0" maxOccurs="1"/>
|
||||||
|
<xsd:element ref="dcterms:created" minOccurs="0" maxOccurs="1"/>
|
||||||
|
<xsd:element ref="dc:identifier" minOccurs="0" maxOccurs="1"/>
|
||||||
|
<xsd:element name="contentType" minOccurs="0" maxOccurs="1" type="xsd:string" ma:index="0" ma:displayName="Tipo de contenido"/>
|
||||||
|
<xsd:element ref="dc:title" minOccurs="0" maxOccurs="1" ma:index="4" ma:displayName="Título"/>
|
||||||
|
<xsd:element ref="dc:subject" minOccurs="0" maxOccurs="1"/>
|
||||||
|
<xsd:element ref="dc:description" minOccurs="0" maxOccurs="1"/>
|
||||||
|
<xsd:element name="keywords" minOccurs="0" maxOccurs="1" type="xsd:string"/>
|
||||||
|
<xsd:element ref="dc:language" minOccurs="0" maxOccurs="1"/>
|
||||||
|
<xsd:element name="category" minOccurs="0" maxOccurs="1" type="xsd:string"/>
|
||||||
|
<xsd:element name="version" minOccurs="0" maxOccurs="1" type="xsd:string"/>
|
||||||
|
<xsd:element name="revision" minOccurs="0" maxOccurs="1" type="xsd:string">
|
||||||
|
<xsd:annotation>
|
||||||
|
<xsd:documentation>
|
||||||
|
This value indicates the number of saves or revisions. The application is responsible for updating this value after each revision.
|
||||||
|
</xsd:documentation>
|
||||||
|
</xsd:annotation>
|
||||||
|
</xsd:element>
|
||||||
|
<xsd:element name="lastModifiedBy" minOccurs="0" maxOccurs="1" type="xsd:string"/>
|
||||||
|
<xsd:element ref="dcterms:modified" minOccurs="0" maxOccurs="1"/>
|
||||||
|
<xsd:element name="contentStatus" minOccurs="0" maxOccurs="1" type="xsd:string"/>
|
||||||
|
</xsd:all>
|
||||||
|
</xsd:complexType>
|
||||||
|
</xsd:schema>
|
||||||
|
<xs:schema targetNamespace="http://schemas.microsoft.com/office/infopath/2007/PartnerControls" elementFormDefault="qualified" attributeFormDefault="unqualified" xmlns:pc="http://schemas.microsoft.com/office/infopath/2007/PartnerControls" xmlns:xs="http://www.w3.org/2001/XMLSchema">
|
||||||
|
<xs:element name="Person">
|
||||||
|
<xs:complexType>
|
||||||
|
<xs:sequence>
|
||||||
|
<xs:element ref="pc:DisplayName" minOccurs="0"></xs:element>
|
||||||
|
<xs:element ref="pc:AccountId" minOccurs="0"></xs:element>
|
||||||
|
<xs:element ref="pc:AccountType" minOccurs="0"></xs:element>
|
||||||
|
</xs:sequence>
|
||||||
|
</xs:complexType>
|
||||||
|
</xs:element>
|
||||||
|
<xs:element name="DisplayName" type="xs:string"></xs:element>
|
||||||
|
<xs:element name="AccountId" type="xs:string"></xs:element>
|
||||||
|
<xs:element name="AccountType" type="xs:string"></xs:element>
|
||||||
|
<xs:element name="BDCAssociatedEntity">
|
||||||
|
<xs:complexType>
|
||||||
|
<xs:sequence>
|
||||||
|
<xs:element ref="pc:BDCEntity" minOccurs="0" maxOccurs="unbounded"></xs:element>
|
||||||
|
</xs:sequence>
|
||||||
|
<xs:attribute ref="pc:EntityNamespace"></xs:attribute>
|
||||||
|
<xs:attribute ref="pc:EntityName"></xs:attribute>
|
||||||
|
<xs:attribute ref="pc:SystemInstanceName"></xs:attribute>
|
||||||
|
<xs:attribute ref="pc:AssociationName"></xs:attribute>
|
||||||
|
</xs:complexType>
|
||||||
|
</xs:element>
|
||||||
|
<xs:attribute name="EntityNamespace" type="xs:string"></xs:attribute>
|
||||||
|
<xs:attribute name="EntityName" type="xs:string"></xs:attribute>
|
||||||
|
<xs:attribute name="SystemInstanceName" type="xs:string"></xs:attribute>
|
||||||
|
<xs:attribute name="AssociationName" type="xs:string"></xs:attribute>
|
||||||
|
<xs:element name="BDCEntity">
|
||||||
|
<xs:complexType>
|
||||||
|
<xs:sequence>
|
||||||
|
<xs:element ref="pc:EntityDisplayName" minOccurs="0"></xs:element>
|
||||||
|
<xs:element ref="pc:EntityInstanceReference" minOccurs="0"></xs:element>
|
||||||
|
<xs:element ref="pc:EntityId1" minOccurs="0"></xs:element>
|
||||||
|
<xs:element ref="pc:EntityId2" minOccurs="0"></xs:element>
|
||||||
|
<xs:element ref="pc:EntityId3" minOccurs="0"></xs:element>
|
||||||
|
<xs:element ref="pc:EntityId4" minOccurs="0"></xs:element>
|
||||||
|
<xs:element ref="pc:EntityId5" minOccurs="0"></xs:element>
|
||||||
|
</xs:sequence>
|
||||||
|
</xs:complexType>
|
||||||
|
</xs:element>
|
||||||
|
<xs:element name="EntityDisplayName" type="xs:string"></xs:element>
|
||||||
|
<xs:element name="EntityInstanceReference" type="xs:string"></xs:element>
|
||||||
|
<xs:element name="EntityId1" type="xs:string"></xs:element>
|
||||||
|
<xs:element name="EntityId2" type="xs:string"></xs:element>
|
||||||
|
<xs:element name="EntityId3" type="xs:string"></xs:element>
|
||||||
|
<xs:element name="EntityId4" type="xs:string"></xs:element>
|
||||||
|
<xs:element name="EntityId5" type="xs:string"></xs:element>
|
||||||
|
<xs:element name="Terms">
|
||||||
|
<xs:complexType>
|
||||||
|
<xs:sequence>
|
||||||
|
<xs:element ref="pc:TermInfo" minOccurs="0" maxOccurs="unbounded"></xs:element>
|
||||||
|
</xs:sequence>
|
||||||
|
</xs:complexType>
|
||||||
|
</xs:element>
|
||||||
|
<xs:element name="TermInfo">
|
||||||
|
<xs:complexType>
|
||||||
|
<xs:sequence>
|
||||||
|
<xs:element ref="pc:TermName" minOccurs="0"></xs:element>
|
||||||
|
<xs:element ref="pc:TermId" minOccurs="0"></xs:element>
|
||||||
|
</xs:sequence>
|
||||||
|
</xs:complexType>
|
||||||
|
</xs:element>
|
||||||
|
<xs:element name="TermName" type="xs:string"></xs:element>
|
||||||
|
<xs:element name="TermId" type="xs:string"></xs:element>
|
||||||
|
</xs:schema>
|
||||||
|
</ct:contentTypeSchema>
|
||||||
1
thesis_output/plantilla_individual_files/item0003.xml
Normal file
@@ -0,0 +1 @@
|
|||||||
|
<?xml version="1.0" encoding="UTF-8" standalone="no"?><b:Sources SelectedStyle="\APASixthEditionOfficeOnline.xsl" StyleName="APA" Version="6" xmlns:b="http://schemas.openxmlformats.org/officeDocument/2006/bibliography" xmlns="http://schemas.openxmlformats.org/officeDocument/2006/bibliography"><b:Source><b:Tag>Dor81</b:Tag><b:SourceType>JournalArticle</b:SourceType><b:Guid>{D7C468B5-5E32-4254-9330-6DB2DDB01037}</b:Guid><b:Title>There's a S.M.A.R.T. way to write management's goals and objectives</b:Title><b:Year>1981</b:Year><b:Author><b:Author><b:NameList><b:Person><b:Last>Doran</b:Last><b:First>G.</b:First><b:Middle>T.</b:Middle></b:Person></b:NameList></b:Author></b:Author><b:JournalName>Management Review (AMA FORUM)</b:JournalName><b:Pages>35-36</b:Pages><b:Volume>70</b:Volume><b:RefOrder>1</b:RefOrder></b:Source></b:Sources>
|
||||||
1
thesis_output/plantilla_individual_files/item0005.xml
Normal file
@@ -0,0 +1 @@
|
|||||||
|
<?xml version="1.0" encoding="utf-8"?><p:properties xmlns:p="http://schemas.microsoft.com/office/2006/metadata/properties" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:pc="http://schemas.microsoft.com/office/infopath/2007/PartnerControls"><documentManagement><lcf76f155ced4ddcb4097134ff3c332f xmlns="27c1adeb-3674-457c-b08c-8a73f31b6e23"><Terms xmlns="http://schemas.microsoft.com/office/infopath/2007/PartnerControls"></Terms></lcf76f155ced4ddcb4097134ff3c332f><TaxCatchAll xmlns="0a70e875-3d35-4be2-921f-7117c31bab9b" xsi:nil="true"/><_Flow_SignoffStatus xmlns="27c1adeb-3674-457c-b08c-8a73f31b6e23" xsi:nil="true"/></documentManagement></p:properties>
|
||||||
1
thesis_output/plantilla_individual_files/item0007.xml
Normal file
@@ -0,0 +1 @@
|
|||||||
|
<?mso-contentType?><FormTemplates xmlns="http://schemas.microsoft.com/sharepoint/v3/contenttype/forms"><Display>DocumentLibraryForm</Display><Edit>DocumentLibraryForm</Edit><New>DocumentLibraryForm</New></FormTemplates>
|
||||||
258
thesis_output/plantilla_individual_files/item0013.xml
Normal file
@@ -0,0 +1,258 @@
|
|||||||
|
<?xml version="1.0" encoding="utf-8"?><ct:contentTypeSchema ct:_="" ma:_="" ma:contentTypeName="Documento" ma:contentTypeID="0x010100DF3D7C797EA12745A270EF30E38719B9" ma:contentTypeVersion="19" ma:contentTypeDescription="Crear nuevo documento." ma:contentTypeScope="" ma:versionID="227b02526234ef39b0b78895a9d90cf5" xmlns:ct="http://schemas.microsoft.com/office/2006/metadata/contentType" xmlns:ma="http://schemas.microsoft.com/office/2006/metadata/properties/metaAttributes">
|
||||||
|
<xsd:schema targetNamespace="http://schemas.microsoft.com/office/2006/metadata/properties" ma:root="true" ma:fieldsID="3c939c8607e2f594db8bbb23634dd059" ns2:_="" ns3:_="" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:p="http://schemas.microsoft.com/office/2006/metadata/properties" xmlns:ns2="0a70e875-3d35-4be2-921f-7117c31bab9b" xmlns:ns3="27c1adeb-3674-457c-b08c-8a73f31b6e23">
|
||||||
|
<xsd:import namespace="0a70e875-3d35-4be2-921f-7117c31bab9b"/>
|
||||||
|
<xsd:import namespace="27c1adeb-3674-457c-b08c-8a73f31b6e23"/>
|
||||||
|
<xsd:element name="properties">
|
||||||
|
<xsd:complexType>
|
||||||
|
<xsd:sequence>
|
||||||
|
<xsd:element name="documentManagement">
|
||||||
|
<xsd:complexType>
|
||||||
|
<xsd:all>
|
||||||
|
<xsd:element ref="ns2:SharedWithUsers" minOccurs="0"/>
|
||||||
|
<xsd:element ref="ns2:SharedWithDetails" minOccurs="0"/>
|
||||||
|
<xsd:element ref="ns3:MediaServiceMetadata" minOccurs="0"/>
|
||||||
|
<xsd:element ref="ns3:MediaServiceFastMetadata" minOccurs="0"/>
|
||||||
|
<xsd:element ref="ns3:MediaServiceAutoKeyPoints" minOccurs="0"/>
|
||||||
|
<xsd:element ref="ns3:MediaServiceKeyPoints" minOccurs="0"/>
|
||||||
|
<xsd:element ref="ns3:MediaServiceAutoTags" minOccurs="0"/>
|
||||||
|
<xsd:element ref="ns3:MediaServiceOCR" minOccurs="0"/>
|
||||||
|
<xsd:element ref="ns3:MediaServiceGenerationTime" minOccurs="0"/>
|
||||||
|
<xsd:element ref="ns3:MediaServiceEventHashCode" minOccurs="0"/>
|
||||||
|
<xsd:element ref="ns3:MediaServiceDateTaken" minOccurs="0"/>
|
||||||
|
<xsd:element ref="ns3:MediaLengthInSeconds" minOccurs="0"/>
|
||||||
|
<xsd:element ref="ns3:MediaServiceLocation" minOccurs="0"/>
|
||||||
|
<xsd:element ref="ns3:lcf76f155ced4ddcb4097134ff3c332f" minOccurs="0"/>
|
||||||
|
<xsd:element ref="ns2:TaxCatchAll" minOccurs="0"/>
|
||||||
|
<xsd:element ref="ns3:MediaServiceSearchProperties" minOccurs="0"/>
|
||||||
|
<xsd:element ref="ns3:_Flow_SignoffStatus" minOccurs="0"/>
|
||||||
|
<xsd:element ref="ns3:MediaServiceObjectDetectorVersions" minOccurs="0"/>
|
||||||
|
</xsd:all>
|
||||||
|
</xsd:complexType>
|
||||||
|
</xsd:element>
|
||||||
|
</xsd:sequence>
|
||||||
|
</xsd:complexType>
|
||||||
|
</xsd:element>
|
||||||
|
</xsd:schema>
|
||||||
|
<xsd:schema targetNamespace="0a70e875-3d35-4be2-921f-7117c31bab9b" elementFormDefault="qualified" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:dms="http://schemas.microsoft.com/office/2006/documentManagement/types" xmlns:pc="http://schemas.microsoft.com/office/infopath/2007/PartnerControls">
|
||||||
|
<xsd:import namespace="http://schemas.microsoft.com/office/2006/documentManagement/types"/>
|
||||||
|
<xsd:import namespace="http://schemas.microsoft.com/office/infopath/2007/PartnerControls"/>
|
||||||
|
<xsd:element name="SharedWithUsers" ma:index="8" nillable="true" ma:displayName="Compartido con" ma:internalName="SharedWithUsers" ma:readOnly="true">
|
||||||
|
<xsd:complexType>
|
||||||
|
<xsd:complexContent>
|
||||||
|
<xsd:extension base="dms:UserMulti">
|
||||||
|
<xsd:sequence>
|
||||||
|
<xsd:element name="UserInfo" minOccurs="0" maxOccurs="unbounded">
|
||||||
|
<xsd:complexType>
|
||||||
|
<xsd:sequence>
|
||||||
|
<xsd:element name="DisplayName" type="xsd:string" minOccurs="0"/>
|
||||||
|
<xsd:element name="AccountId" type="dms:UserId" minOccurs="0" nillable="true"/>
|
||||||
|
<xsd:element name="AccountType" type="xsd:string" minOccurs="0"/>
|
||||||
|
</xsd:sequence>
|
||||||
|
</xsd:complexType>
|
||||||
|
</xsd:element>
|
||||||
|
</xsd:sequence>
|
||||||
|
</xsd:extension>
|
||||||
|
</xsd:complexContent>
|
||||||
|
</xsd:complexType>
|
||||||
|
</xsd:element>
|
||||||
|
<xsd:element name="SharedWithDetails" ma:index="9" nillable="true" ma:displayName="Detalles de uso compartido" ma:internalName="SharedWithDetails" ma:readOnly="true">
|
||||||
|
<xsd:simpleType>
|
||||||
|
<xsd:restriction base="dms:Note">
|
||||||
|
<xsd:maxLength value="255"/>
|
||||||
|
</xsd:restriction>
|
||||||
|
</xsd:simpleType>
|
||||||
|
</xsd:element>
|
||||||
|
<xsd:element name="TaxCatchAll" ma:index="23" nillable="true" ma:displayName="Taxonomy Catch All Column" ma:hidden="true" ma:list="{c7f67346-78c9-4c4d-b954-8d350fdf60db}" ma:internalName="TaxCatchAll" ma:showField="CatchAllData" ma:web="0a70e875-3d35-4be2-921f-7117c31bab9b">
|
||||||
|
<xsd:complexType>
|
||||||
|
<xsd:complexContent>
|
||||||
|
<xsd:extension base="dms:MultiChoiceLookup">
|
||||||
|
<xsd:sequence>
|
||||||
|
<xsd:element name="Value" type="dms:Lookup" maxOccurs="unbounded" minOccurs="0" nillable="true"/>
|
||||||
|
</xsd:sequence>
|
||||||
|
</xsd:extension>
|
||||||
|
</xsd:complexContent>
|
||||||
|
</xsd:complexType>
|
||||||
|
</xsd:element>
|
||||||
|
</xsd:schema>
|
||||||
|
<xsd:schema targetNamespace="27c1adeb-3674-457c-b08c-8a73f31b6e23" elementFormDefault="qualified" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:dms="http://schemas.microsoft.com/office/2006/documentManagement/types" xmlns:pc="http://schemas.microsoft.com/office/infopath/2007/PartnerControls">
|
||||||
|
<xsd:import namespace="http://schemas.microsoft.com/office/2006/documentManagement/types"/>
|
||||||
|
<xsd:import namespace="http://schemas.microsoft.com/office/infopath/2007/PartnerControls"/>
|
||||||
|
<xsd:element name="MediaServiceMetadata" ma:index="10" nillable="true" ma:displayName="MediaServiceMetadata" ma:hidden="true" ma:internalName="MediaServiceMetadata" ma:readOnly="true">
|
||||||
|
<xsd:simpleType>
|
||||||
|
<xsd:restriction base="dms:Note"/>
|
||||||
|
</xsd:simpleType>
|
||||||
|
</xsd:element>
|
||||||
|
<xsd:element name="MediaServiceFastMetadata" ma:index="11" nillable="true" ma:displayName="MediaServiceFastMetadata" ma:hidden="true" ma:internalName="MediaServiceFastMetadata" ma:readOnly="true">
|
||||||
|
<xsd:simpleType>
|
||||||
|
<xsd:restriction base="dms:Note"/>
|
||||||
|
</xsd:simpleType>
|
||||||
|
</xsd:element>
|
||||||
|
<xsd:element name="MediaServiceAutoKeyPoints" ma:index="12" nillable="true" ma:displayName="MediaServiceAutoKeyPoints" ma:hidden="true" ma:internalName="MediaServiceAutoKeyPoints" ma:readOnly="true">
|
||||||
|
<xsd:simpleType>
|
||||||
|
<xsd:restriction base="dms:Note"/>
|
||||||
|
</xsd:simpleType>
|
||||||
|
</xsd:element>
|
||||||
|
<xsd:element name="MediaServiceKeyPoints" ma:index="13" nillable="true" ma:displayName="KeyPoints" ma:internalName="MediaServiceKeyPoints" ma:readOnly="true">
|
||||||
|
<xsd:simpleType>
|
||||||
|
<xsd:restriction base="dms:Note">
|
||||||
|
<xsd:maxLength value="255"/>
|
||||||
|
</xsd:restriction>
|
||||||
|
</xsd:simpleType>
|
||||||
|
</xsd:element>
|
||||||
|
<xsd:element name="MediaServiceAutoTags" ma:index="14" nillable="true" ma:displayName="Tags" ma:internalName="MediaServiceAutoTags" ma:readOnly="true">
|
||||||
|
<xsd:simpleType>
|
||||||
|
<xsd:restriction base="dms:Text"/>
|
||||||
|
</xsd:simpleType>
|
||||||
|
</xsd:element>
|
||||||
|
<xsd:element name="MediaServiceOCR" ma:index="15" nillable="true" ma:displayName="Extracted Text" ma:internalName="MediaServiceOCR" ma:readOnly="true">
|
||||||
|
<xsd:simpleType>
|
||||||
|
<xsd:restriction base="dms:Note">
|
||||||
|
<xsd:maxLength value="255"/>
|
||||||
|
</xsd:restriction>
|
||||||
|
</xsd:simpleType>
|
||||||
|
</xsd:element>
|
||||||
|
<xsd:element name="MediaServiceGenerationTime" ma:index="16" nillable="true" ma:displayName="MediaServiceGenerationTime" ma:hidden="true" ma:internalName="MediaServiceGenerationTime" ma:readOnly="true">
|
||||||
|
<xsd:simpleType>
|
||||||
|
<xsd:restriction base="dms:Text"/>
|
||||||
|
</xsd:simpleType>
|
||||||
|
</xsd:element>
|
||||||
|
<xsd:element name="MediaServiceEventHashCode" ma:index="17" nillable="true" ma:displayName="MediaServiceEventHashCode" ma:hidden="true" ma:internalName="MediaServiceEventHashCode" ma:readOnly="true">
|
||||||
|
<xsd:simpleType>
|
||||||
|
<xsd:restriction base="dms:Text"/>
|
||||||
|
</xsd:simpleType>
|
||||||
|
</xsd:element>
|
||||||
|
<xsd:element name="MediaServiceDateTaken" ma:index="18" nillable="true" ma:displayName="MediaServiceDateTaken" ma:hidden="true" ma:internalName="MediaServiceDateTaken" ma:readOnly="true">
|
||||||
|
<xsd:simpleType>
|
||||||
|
<xsd:restriction base="dms:Text"/>
|
||||||
|
</xsd:simpleType>
|
||||||
|
</xsd:element>
|
||||||
|
<xsd:element name="MediaLengthInSeconds" ma:index="19" nillable="true" ma:displayName="Length (seconds)" ma:internalName="MediaLengthInSeconds" ma:readOnly="true">
|
||||||
|
<xsd:simpleType>
|
||||||
|
<xsd:restriction base="dms:Unknown"/>
|
||||||
|
</xsd:simpleType>
|
||||||
|
</xsd:element>
|
||||||
|
<xsd:element name="MediaServiceLocation" ma:index="20" nillable="true" ma:displayName="Location" ma:internalName="MediaServiceLocation" ma:readOnly="true">
|
||||||
|
<xsd:simpleType>
|
||||||
|
<xsd:restriction base="dms:Text"/>
|
||||||
|
</xsd:simpleType>
|
||||||
|
</xsd:element>
|
||||||
|
<xsd:element name="lcf76f155ced4ddcb4097134ff3c332f" ma:index="22" nillable="true" ma:taxonomy="true" ma:internalName="lcf76f155ced4ddcb4097134ff3c332f" ma:taxonomyFieldName="MediaServiceImageTags" ma:displayName="Etiquetas de imagen" ma:readOnly="false" ma:fieldId="{5cf76f15-5ced-4ddc-b409-7134ff3c332f}" ma:taxonomyMulti="true" ma:sspId="17631b59-e624-4eb7-963c-219f14f887a3" ma:termSetId="09814cd3-568e-fe90-9814-8d621ff8fb84" ma:anchorId="fba54fb3-c3e1-fe81-a776-ca4b69148c4d" ma:open="true" ma:isKeyword="false">
|
||||||
|
<xsd:complexType>
|
||||||
|
<xsd:sequence>
|
||||||
|
<xsd:element ref="pc:Terms" minOccurs="0" maxOccurs="1"></xsd:element>
|
||||||
|
</xsd:sequence>
|
||||||
|
</xsd:complexType>
|
||||||
|
</xsd:element>
|
||||||
|
<xsd:element name="MediaServiceSearchProperties" ma:index="24" nillable="true" ma:displayName="MediaServiceSearchProperties" ma:hidden="true" ma:internalName="MediaServiceSearchProperties" ma:readOnly="true">
|
||||||
|
<xsd:simpleType>
|
||||||
|
<xsd:restriction base="dms:Note"/>
|
||||||
|
</xsd:simpleType>
|
||||||
|
</xsd:element>
|
||||||
|
<xsd:element name="_Flow_SignoffStatus" ma:index="25" nillable="true" ma:displayName="Estado de aprobación" ma:internalName="Estado_x0020_de_x0020_aprobaci_x00f3_n">
|
||||||
|
<xsd:simpleType>
|
||||||
|
<xsd:restriction base="dms:Text"/>
|
||||||
|
</xsd:simpleType>
|
||||||
|
</xsd:element>
|
||||||
|
<xsd:element name="MediaServiceObjectDetectorVersions" ma:index="26" nillable="true" ma:displayName="MediaServiceObjectDetectorVersions" ma:description="" ma:hidden="true" ma:indexed="true" ma:internalName="MediaServiceObjectDetectorVersions" ma:readOnly="true">
|
||||||
|
<xsd:simpleType>
|
||||||
|
<xsd:restriction base="dms:Text"/>
|
||||||
|
</xsd:simpleType>
|
||||||
|
</xsd:element>
|
||||||
|
</xsd:schema>
|
||||||
|
<xsd:schema targetNamespace="http://schemas.openxmlformats.org/package/2006/metadata/core-properties" elementFormDefault="qualified" attributeFormDefault="unqualified" blockDefault="#all" xmlns="http://schemas.openxmlformats.org/package/2006/metadata/core-properties" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:odoc="http://schemas.microsoft.com/internal/obd">
|
||||||
|
<xsd:import namespace="http://purl.org/dc/elements/1.1/" schemaLocation="http://dublincore.org/schemas/xmls/qdc/2003/04/02/dc.xsd"/>
|
||||||
|
<xsd:import namespace="http://purl.org/dc/terms/" schemaLocation="http://dublincore.org/schemas/xmls/qdc/2003/04/02/dcterms.xsd"/>
|
||||||
|
<xsd:element name="coreProperties" type="CT_coreProperties"/>
|
||||||
|
<xsd:complexType name="CT_coreProperties">
|
||||||
|
<xsd:all>
|
||||||
|
<xsd:element ref="dc:creator" minOccurs="0" maxOccurs="1"/>
|
||||||
|
<xsd:element ref="dcterms:created" minOccurs="0" maxOccurs="1"/>
|
||||||
|
<xsd:element ref="dc:identifier" minOccurs="0" maxOccurs="1"/>
|
||||||
|
<xsd:element name="contentType" minOccurs="0" maxOccurs="1" type="xsd:string" ma:index="0" ma:displayName="Tipo de contenido"/>
|
||||||
|
<xsd:element ref="dc:title" minOccurs="0" maxOccurs="1" ma:index="4" ma:displayName="Título"/>
|
||||||
|
<xsd:element ref="dc:subject" minOccurs="0" maxOccurs="1"/>
|
||||||
|
<xsd:element ref="dc:description" minOccurs="0" maxOccurs="1"/>
|
||||||
|
<xsd:element name="keywords" minOccurs="0" maxOccurs="1" type="xsd:string"/>
|
||||||
|
<xsd:element ref="dc:language" minOccurs="0" maxOccurs="1"/>
|
||||||
|
<xsd:element name="category" minOccurs="0" maxOccurs="1" type="xsd:string"/>
|
||||||
|
<xsd:element name="version" minOccurs="0" maxOccurs="1" type="xsd:string"/>
|
||||||
|
<xsd:element name="revision" minOccurs="0" maxOccurs="1" type="xsd:string">
|
||||||
|
<xsd:annotation>
|
||||||
|
<xsd:documentation>
|
||||||
|
This value indicates the number of saves or revisions. The application is responsible for updating this value after each revision.
|
||||||
|
</xsd:documentation>
|
||||||
|
</xsd:annotation>
|
||||||
|
</xsd:element>
|
||||||
|
<xsd:element name="lastModifiedBy" minOccurs="0" maxOccurs="1" type="xsd:string"/>
|
||||||
|
<xsd:element ref="dcterms:modified" minOccurs="0" maxOccurs="1"/>
|
||||||
|
<xsd:element name="contentStatus" minOccurs="0" maxOccurs="1" type="xsd:string"/>
|
||||||
|
</xsd:all>
|
||||||
|
</xsd:complexType>
|
||||||
|
</xsd:schema>
|
||||||
|
<xs:schema targetNamespace="http://schemas.microsoft.com/office/infopath/2007/PartnerControls" elementFormDefault="qualified" attributeFormDefault="unqualified" xmlns:pc="http://schemas.microsoft.com/office/infopath/2007/PartnerControls" xmlns:xs="http://www.w3.org/2001/XMLSchema">
|
||||||
|
<xs:element name="Person">
|
||||||
|
<xs:complexType>
|
||||||
|
<xs:sequence>
|
||||||
|
<xs:element ref="pc:DisplayName" minOccurs="0"></xs:element>
|
||||||
|
<xs:element ref="pc:AccountId" minOccurs="0"></xs:element>
|
||||||
|
<xs:element ref="pc:AccountType" minOccurs="0"></xs:element>
|
||||||
|
</xs:sequence>
|
||||||
|
</xs:complexType>
|
||||||
|
</xs:element>
|
||||||
|
<xs:element name="DisplayName" type="xs:string"></xs:element>
|
||||||
|
<xs:element name="AccountId" type="xs:string"></xs:element>
|
||||||
|
<xs:element name="AccountType" type="xs:string"></xs:element>
|
||||||
|
<xs:element name="BDCAssociatedEntity">
|
||||||
|
<xs:complexType>
|
||||||
|
<xs:sequence>
|
||||||
|
<xs:element ref="pc:BDCEntity" minOccurs="0" maxOccurs="unbounded"></xs:element>
|
||||||
|
</xs:sequence>
|
||||||
|
<xs:attribute ref="pc:EntityNamespace"></xs:attribute>
|
||||||
|
<xs:attribute ref="pc:EntityName"></xs:attribute>
|
||||||
|
<xs:attribute ref="pc:SystemInstanceName"></xs:attribute>
|
||||||
|
<xs:attribute ref="pc:AssociationName"></xs:attribute>
|
||||||
|
</xs:complexType>
|
||||||
|
</xs:element>
|
||||||
|
<xs:attribute name="EntityNamespace" type="xs:string"></xs:attribute>
|
||||||
|
<xs:attribute name="EntityName" type="xs:string"></xs:attribute>
|
||||||
|
<xs:attribute name="SystemInstanceName" type="xs:string"></xs:attribute>
|
||||||
|
<xs:attribute name="AssociationName" type="xs:string"></xs:attribute>
|
||||||
|
<xs:element name="BDCEntity">
|
||||||
|
<xs:complexType>
|
||||||
|
<xs:sequence>
|
||||||
|
<xs:element ref="pc:EntityDisplayName" minOccurs="0"></xs:element>
|
||||||
|
<xs:element ref="pc:EntityInstanceReference" minOccurs="0"></xs:element>
|
||||||
|
<xs:element ref="pc:EntityId1" minOccurs="0"></xs:element>
|
||||||
|
<xs:element ref="pc:EntityId2" minOccurs="0"></xs:element>
|
||||||
|
<xs:element ref="pc:EntityId3" minOccurs="0"></xs:element>
|
||||||
|
<xs:element ref="pc:EntityId4" minOccurs="0"></xs:element>
|
||||||
|
<xs:element ref="pc:EntityId5" minOccurs="0"></xs:element>
|
||||||
|
</xs:sequence>
|
||||||
|
</xs:complexType>
|
||||||
|
</xs:element>
|
||||||
|
<xs:element name="EntityDisplayName" type="xs:string"></xs:element>
|
||||||
|
<xs:element name="EntityInstanceReference" type="xs:string"></xs:element>
|
||||||
|
<xs:element name="EntityId1" type="xs:string"></xs:element>
|
||||||
|
<xs:element name="EntityId2" type="xs:string"></xs:element>
|
||||||
|
<xs:element name="EntityId3" type="xs:string"></xs:element>
|
||||||
|
<xs:element name="EntityId4" type="xs:string"></xs:element>
|
||||||
|
<xs:element name="EntityId5" type="xs:string"></xs:element>
|
||||||
|
<xs:element name="Terms">
|
||||||
|
<xs:complexType>
|
||||||
|
<xs:sequence>
|
||||||
|
<xs:element ref="pc:TermInfo" minOccurs="0" maxOccurs="unbounded"></xs:element>
|
||||||
|
</xs:sequence>
|
||||||
|
</xs:complexType>
|
||||||
|
</xs:element>
|
||||||
|
<xs:element name="TermInfo">
|
||||||
|
<xs:complexType>
|
||||||
|
<xs:sequence>
|
||||||
|
<xs:element ref="pc:TermName" minOccurs="0"></xs:element>
|
||||||
|
<xs:element ref="pc:TermId" minOccurs="0"></xs:element>
|
||||||
|
</xs:sequence>
|
||||||
|
</xs:complexType>
|
||||||
|
</xs:element>
|
||||||
|
<xs:element name="TermName" type="xs:string"></xs:element>
|
||||||
|
<xs:element name="TermId" type="xs:string"></xs:element>
|
||||||
|
</xs:schema>
|
||||||
|
</ct:contentTypeSchema>
|
||||||
1
thesis_output/plantilla_individual_files/item0015.xml
Normal file
@@ -0,0 +1 @@
|
|||||||
|
<?xml version="1.0" encoding="UTF-8" standalone="no"?><b:Sources SelectedStyle="\APASixthEditionOfficeOnline.xsl" StyleName="APA" Version="6" xmlns:b="http://schemas.openxmlformats.org/officeDocument/2006/bibliography" xmlns="http://schemas.openxmlformats.org/officeDocument/2006/bibliography"><b:Source><b:Tag>Dor81</b:Tag><b:SourceType>JournalArticle</b:SourceType><b:Guid>{D7C468B5-5E32-4254-9330-6DB2DDB01037}</b:Guid><b:Title>There's a S.M.A.R.T. way to write management's goals and objectives</b:Title><b:Year>1981</b:Year><b:Author><b:Author><b:NameList><b:Person><b:Last>Doran</b:Last><b:First>G.</b:First><b:Middle>T.</b:Middle></b:Person></b:NameList></b:Author></b:Author><b:JournalName>Management Review (AMA FORUM)</b:JournalName><b:Pages>35-36</b:Pages><b:Volume>70</b:Volume><b:RefOrder>1</b:RefOrder></b:Source></b:Sources>
|
||||||
1
thesis_output/plantilla_individual_files/item0017.xml
Normal file
@@ -0,0 +1 @@
|
|||||||
|
<?mso-contentType?><FormTemplates xmlns="http://schemas.microsoft.com/sharepoint/v3/contenttype/forms"><Display>DocumentLibraryForm</Display><Edit>DocumentLibraryForm</Edit><New>DocumentLibraryForm</New></FormTemplates>
|
||||||
258
thesis_output/plantilla_individual_files/item0019.xml
Normal file
@@ -0,0 +1,258 @@
|
|||||||
|
<?xml version="1.0" encoding="utf-8"?><ct:contentTypeSchema ct:_="" ma:_="" ma:contentTypeName="Documento" ma:contentTypeID="0x010100DF3D7C797EA12745A270EF30E38719B9" ma:contentTypeVersion="19" ma:contentTypeDescription="Crear nuevo documento." ma:contentTypeScope="" ma:versionID="227b02526234ef39b0b78895a9d90cf5" xmlns:ct="http://schemas.microsoft.com/office/2006/metadata/contentType" xmlns:ma="http://schemas.microsoft.com/office/2006/metadata/properties/metaAttributes">
|
||||||
|
<xsd:schema targetNamespace="http://schemas.microsoft.com/office/2006/metadata/properties" ma:root="true" ma:fieldsID="3c939c8607e2f594db8bbb23634dd059" ns2:_="" ns3:_="" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:p="http://schemas.microsoft.com/office/2006/metadata/properties" xmlns:ns2="0a70e875-3d35-4be2-921f-7117c31bab9b" xmlns:ns3="27c1adeb-3674-457c-b08c-8a73f31b6e23">
|
||||||
|
<xsd:import namespace="0a70e875-3d35-4be2-921f-7117c31bab9b"/>
|
||||||
|
<xsd:import namespace="27c1adeb-3674-457c-b08c-8a73f31b6e23"/>
|
||||||
|
<xsd:element name="properties">
|
||||||
|
<xsd:complexType>
|
||||||
|
<xsd:sequence>
|
||||||
|
<xsd:element name="documentManagement">
|
||||||
|
<xsd:complexType>
|
||||||
|
<xsd:all>
|
||||||
|
<xsd:element ref="ns2:SharedWithUsers" minOccurs="0"/>
|
||||||
|
<xsd:element ref="ns2:SharedWithDetails" minOccurs="0"/>
|
||||||
|
<xsd:element ref="ns3:MediaServiceMetadata" minOccurs="0"/>
|
||||||
|
<xsd:element ref="ns3:MediaServiceFastMetadata" minOccurs="0"/>
|
||||||
|
<xsd:element ref="ns3:MediaServiceAutoKeyPoints" minOccurs="0"/>
|
||||||
|
<xsd:element ref="ns3:MediaServiceKeyPoints" minOccurs="0"/>
|
||||||
|
<xsd:element ref="ns3:MediaServiceAutoTags" minOccurs="0"/>
|
||||||
|
<xsd:element ref="ns3:MediaServiceOCR" minOccurs="0"/>
|
||||||
|
<xsd:element ref="ns3:MediaServiceGenerationTime" minOccurs="0"/>
|
||||||
|
<xsd:element ref="ns3:MediaServiceEventHashCode" minOccurs="0"/>
|
||||||
|
<xsd:element ref="ns3:MediaServiceDateTaken" minOccurs="0"/>
|
||||||
|
<xsd:element ref="ns3:MediaLengthInSeconds" minOccurs="0"/>
|
||||||
|
<xsd:element ref="ns3:MediaServiceLocation" minOccurs="0"/>
|
||||||
|
<xsd:element ref="ns3:lcf76f155ced4ddcb4097134ff3c332f" minOccurs="0"/>
|
||||||
|
<xsd:element ref="ns2:TaxCatchAll" minOccurs="0"/>
|
||||||
|
<xsd:element ref="ns3:MediaServiceSearchProperties" minOccurs="0"/>
|
||||||
|
<xsd:element ref="ns3:_Flow_SignoffStatus" minOccurs="0"/>
|
||||||
|
<xsd:element ref="ns3:MediaServiceObjectDetectorVersions" minOccurs="0"/>
|
||||||
|
</xsd:all>
|
||||||
|
</xsd:complexType>
|
||||||
|
</xsd:element>
|
||||||
|
</xsd:sequence>
|
||||||
|
</xsd:complexType>
|
||||||
|
</xsd:element>
|
||||||
|
</xsd:schema>
|
||||||
|
<xsd:schema targetNamespace="0a70e875-3d35-4be2-921f-7117c31bab9b" elementFormDefault="qualified" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:dms="http://schemas.microsoft.com/office/2006/documentManagement/types" xmlns:pc="http://schemas.microsoft.com/office/infopath/2007/PartnerControls">
|
||||||
|
<xsd:import namespace="http://schemas.microsoft.com/office/2006/documentManagement/types"/>
|
||||||
|
<xsd:import namespace="http://schemas.microsoft.com/office/infopath/2007/PartnerControls"/>
|
||||||
|
<xsd:element name="SharedWithUsers" ma:index="8" nillable="true" ma:displayName="Compartido con" ma:internalName="SharedWithUsers" ma:readOnly="true">
|
||||||
|
<xsd:complexType>
|
||||||
|
<xsd:complexContent>
|
||||||
|
<xsd:extension base="dms:UserMulti">
|
||||||
|
<xsd:sequence>
|
||||||
|
<xsd:element name="UserInfo" minOccurs="0" maxOccurs="unbounded">
|
||||||
|
<xsd:complexType>
|
||||||
|
<xsd:sequence>
|
||||||
|
<xsd:element name="DisplayName" type="xsd:string" minOccurs="0"/>
|
||||||
|
<xsd:element name="AccountId" type="dms:UserId" minOccurs="0" nillable="true"/>
|
||||||
|
<xsd:element name="AccountType" type="xsd:string" minOccurs="0"/>
|
||||||
|
</xsd:sequence>
|
||||||
|
</xsd:complexType>
|
||||||
|
</xsd:element>
|
||||||
|
</xsd:sequence>
|
||||||
|
</xsd:extension>
|
||||||
|
</xsd:complexContent>
|
||||||
|
</xsd:complexType>
|
||||||
|
</xsd:element>
|
||||||
|
<xsd:element name="SharedWithDetails" ma:index="9" nillable="true" ma:displayName="Detalles de uso compartido" ma:internalName="SharedWithDetails" ma:readOnly="true">
|
||||||
|
<xsd:simpleType>
|
||||||
|
<xsd:restriction base="dms:Note">
|
||||||
|
<xsd:maxLength value="255"/>
|
||||||
|
</xsd:restriction>
|
||||||
|
</xsd:simpleType>
|
||||||
|
</xsd:element>
|
||||||
|
<xsd:element name="TaxCatchAll" ma:index="23" nillable="true" ma:displayName="Taxonomy Catch All Column" ma:hidden="true" ma:list="{c7f67346-78c9-4c4d-b954-8d350fdf60db}" ma:internalName="TaxCatchAll" ma:showField="CatchAllData" ma:web="0a70e875-3d35-4be2-921f-7117c31bab9b">
|
||||||
|
<xsd:complexType>
|
||||||
|
<xsd:complexContent>
|
||||||
|
<xsd:extension base="dms:MultiChoiceLookup">
|
||||||
|
<xsd:sequence>
|
||||||
|
<xsd:element name="Value" type="dms:Lookup" maxOccurs="unbounded" minOccurs="0" nillable="true"/>
|
||||||
|
</xsd:sequence>
|
||||||
|
</xsd:extension>
|
||||||
|
</xsd:complexContent>
|
||||||
|
</xsd:complexType>
|
||||||
|
</xsd:element>
|
||||||
|
</xsd:schema>
|
||||||
|
<xsd:schema targetNamespace="27c1adeb-3674-457c-b08c-8a73f31b6e23" elementFormDefault="qualified" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:dms="http://schemas.microsoft.com/office/2006/documentManagement/types" xmlns:pc="http://schemas.microsoft.com/office/infopath/2007/PartnerControls">
|
||||||
|
<xsd:import namespace="http://schemas.microsoft.com/office/2006/documentManagement/types"/>
|
||||||
|
<xsd:import namespace="http://schemas.microsoft.com/office/infopath/2007/PartnerControls"/>
|
||||||
|
<xsd:element name="MediaServiceMetadata" ma:index="10" nillable="true" ma:displayName="MediaServiceMetadata" ma:hidden="true" ma:internalName="MediaServiceMetadata" ma:readOnly="true">
|
||||||
|
<xsd:simpleType>
|
||||||
|
<xsd:restriction base="dms:Note"/>
|
||||||
|
</xsd:simpleType>
|
||||||
|
</xsd:element>
|
||||||
|
<xsd:element name="MediaServiceFastMetadata" ma:index="11" nillable="true" ma:displayName="MediaServiceFastMetadata" ma:hidden="true" ma:internalName="MediaServiceFastMetadata" ma:readOnly="true">
|
||||||
|
<xsd:simpleType>
|
||||||
|
<xsd:restriction base="dms:Note"/>
|
||||||
|
</xsd:simpleType>
|
||||||
|
</xsd:element>
|
||||||
|
<xsd:element name="MediaServiceAutoKeyPoints" ma:index="12" nillable="true" ma:displayName="MediaServiceAutoKeyPoints" ma:hidden="true" ma:internalName="MediaServiceAutoKeyPoints" ma:readOnly="true">
|
||||||
|
<xsd:simpleType>
|
||||||
|
<xsd:restriction base="dms:Note"/>
|
||||||
|
</xsd:simpleType>
|
||||||
|
</xsd:element>
|
||||||
|
<xsd:element name="MediaServiceKeyPoints" ma:index="13" nillable="true" ma:displayName="KeyPoints" ma:internalName="MediaServiceKeyPoints" ma:readOnly="true">
|
||||||
|
<xsd:simpleType>
|
||||||
|
<xsd:restriction base="dms:Note">
|
||||||
|
<xsd:maxLength value="255"/>
|
||||||
|
</xsd:restriction>
|
||||||
|
</xsd:simpleType>
|
||||||
|
</xsd:element>
|
||||||
|
<xsd:element name="MediaServiceAutoTags" ma:index="14" nillable="true" ma:displayName="Tags" ma:internalName="MediaServiceAutoTags" ma:readOnly="true">
|
||||||
|
<xsd:simpleType>
|
||||||
|
<xsd:restriction base="dms:Text"/>
|
||||||
|
</xsd:simpleType>
|
||||||
|
</xsd:element>
|
||||||
|
<xsd:element name="MediaServiceOCR" ma:index="15" nillable="true" ma:displayName="Extracted Text" ma:internalName="MediaServiceOCR" ma:readOnly="true">
|
||||||
|
<xsd:simpleType>
|
||||||
|
<xsd:restriction base="dms:Note">
|
||||||
|
<xsd:maxLength value="255"/>
|
||||||
|
</xsd:restriction>
|
||||||
|
</xsd:simpleType>
|
||||||
|
</xsd:element>
|
||||||
|
<xsd:element name="MediaServiceGenerationTime" ma:index="16" nillable="true" ma:displayName="MediaServiceGenerationTime" ma:hidden="true" ma:internalName="MediaServiceGenerationTime" ma:readOnly="true">
|
||||||
|
<xsd:simpleType>
|
||||||
|
<xsd:restriction base="dms:Text"/>
|
||||||
|
</xsd:simpleType>
|
||||||
|
</xsd:element>
|
||||||
|
<xsd:element name="MediaServiceEventHashCode" ma:index="17" nillable="true" ma:displayName="MediaServiceEventHashCode" ma:hidden="true" ma:internalName="MediaServiceEventHashCode" ma:readOnly="true">
|
||||||
|
<xsd:simpleType>
|
||||||
|
<xsd:restriction base="dms:Text"/>
|
||||||
|
</xsd:simpleType>
|
||||||
|
</xsd:element>
|
||||||
|
<xsd:element name="MediaServiceDateTaken" ma:index="18" nillable="true" ma:displayName="MediaServiceDateTaken" ma:hidden="true" ma:internalName="MediaServiceDateTaken" ma:readOnly="true">
|
||||||
|
<xsd:simpleType>
|
||||||
|
<xsd:restriction base="dms:Text"/>
|
||||||
|
</xsd:simpleType>
|
||||||
|
</xsd:element>
|
||||||
|
<xsd:element name="MediaLengthInSeconds" ma:index="19" nillable="true" ma:displayName="Length (seconds)" ma:internalName="MediaLengthInSeconds" ma:readOnly="true">
|
||||||
|
<xsd:simpleType>
|
||||||
|
<xsd:restriction base="dms:Unknown"/>
|
||||||
|
</xsd:simpleType>
|
||||||
|
</xsd:element>
|
||||||
|
<xsd:element name="MediaServiceLocation" ma:index="20" nillable="true" ma:displayName="Location" ma:internalName="MediaServiceLocation" ma:readOnly="true">
|
||||||
|
<xsd:simpleType>
|
||||||
|
<xsd:restriction base="dms:Text"/>
|
||||||
|
</xsd:simpleType>
|
||||||
|
</xsd:element>
|
||||||
|
<xsd:element name="lcf76f155ced4ddcb4097134ff3c332f" ma:index="22" nillable="true" ma:taxonomy="true" ma:internalName="lcf76f155ced4ddcb4097134ff3c332f" ma:taxonomyFieldName="MediaServiceImageTags" ma:displayName="Etiquetas de imagen" ma:readOnly="false" ma:fieldId="{5cf76f15-5ced-4ddc-b409-7134ff3c332f}" ma:taxonomyMulti="true" ma:sspId="17631b59-e624-4eb7-963c-219f14f887a3" ma:termSetId="09814cd3-568e-fe90-9814-8d621ff8fb84" ma:anchorId="fba54fb3-c3e1-fe81-a776-ca4b69148c4d" ma:open="true" ma:isKeyword="false">
|
||||||
|
<xsd:complexType>
|
||||||
|
<xsd:sequence>
|
||||||
|
<xsd:element ref="pc:Terms" minOccurs="0" maxOccurs="1"></xsd:element>
|
||||||
|
</xsd:sequence>
|
||||||
|
</xsd:complexType>
|
||||||
|
</xsd:element>
|
||||||
|
<xsd:element name="MediaServiceSearchProperties" ma:index="24" nillable="true" ma:displayName="MediaServiceSearchProperties" ma:hidden="true" ma:internalName="MediaServiceSearchProperties" ma:readOnly="true">
|
||||||
|
<xsd:simpleType>
|
||||||
|
<xsd:restriction base="dms:Note"/>
|
||||||
|
</xsd:simpleType>
|
||||||
|
</xsd:element>
|
||||||
|
<xsd:element name="_Flow_SignoffStatus" ma:index="25" nillable="true" ma:displayName="Estado de aprobación" ma:internalName="Estado_x0020_de_x0020_aprobaci_x00f3_n">
|
||||||
|
<xsd:simpleType>
|
||||||
|
<xsd:restriction base="dms:Text"/>
|
||||||
|
</xsd:simpleType>
|
||||||
|
</xsd:element>
|
||||||
|
<xsd:element name="MediaServiceObjectDetectorVersions" ma:index="26" nillable="true" ma:displayName="MediaServiceObjectDetectorVersions" ma:description="" ma:hidden="true" ma:indexed="true" ma:internalName="MediaServiceObjectDetectorVersions" ma:readOnly="true">
|
||||||
|
<xsd:simpleType>
|
||||||
|
<xsd:restriction base="dms:Text"/>
|
||||||
|
</xsd:simpleType>
|
||||||
|
</xsd:element>
|
||||||
|
</xsd:schema>
|
||||||
|
<xsd:schema targetNamespace="http://schemas.openxmlformats.org/package/2006/metadata/core-properties" elementFormDefault="qualified" attributeFormDefault="unqualified" blockDefault="#all" xmlns="http://schemas.openxmlformats.org/package/2006/metadata/core-properties" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:odoc="http://schemas.microsoft.com/internal/obd">
|
||||||
|
<xsd:import namespace="http://purl.org/dc/elements/1.1/" schemaLocation="http://dublincore.org/schemas/xmls/qdc/2003/04/02/dc.xsd"/>
|
||||||
|
<xsd:import namespace="http://purl.org/dc/terms/" schemaLocation="http://dublincore.org/schemas/xmls/qdc/2003/04/02/dcterms.xsd"/>
|
||||||
|
<xsd:element name="coreProperties" type="CT_coreProperties"/>
|
||||||
|
<xsd:complexType name="CT_coreProperties">
|
||||||
|
<xsd:all>
|
||||||
|
<xsd:element ref="dc:creator" minOccurs="0" maxOccurs="1"/>
|
||||||
|
<xsd:element ref="dcterms:created" minOccurs="0" maxOccurs="1"/>
|
||||||
|
<xsd:element ref="dc:identifier" minOccurs="0" maxOccurs="1"/>
|
||||||
|
<xsd:element name="contentType" minOccurs="0" maxOccurs="1" type="xsd:string" ma:index="0" ma:displayName="Tipo de contenido"/>
|
||||||
|
<xsd:element ref="dc:title" minOccurs="0" maxOccurs="1" ma:index="4" ma:displayName="Título"/>
|
||||||
|
<xsd:element ref="dc:subject" minOccurs="0" maxOccurs="1"/>
|
||||||
|
<xsd:element ref="dc:description" minOccurs="0" maxOccurs="1"/>
|
||||||
|
<xsd:element name="keywords" minOccurs="0" maxOccurs="1" type="xsd:string"/>
|
||||||
|
<xsd:element ref="dc:language" minOccurs="0" maxOccurs="1"/>
|
||||||
|
<xsd:element name="category" minOccurs="0" maxOccurs="1" type="xsd:string"/>
|
||||||
|
<xsd:element name="version" minOccurs="0" maxOccurs="1" type="xsd:string"/>
|
||||||
|
<xsd:element name="revision" minOccurs="0" maxOccurs="1" type="xsd:string">
|
||||||
|
<xsd:annotation>
|
||||||
|
<xsd:documentation>
|
||||||
|
This value indicates the number of saves or revisions. The application is responsible for updating this value after each revision.
|
||||||
|
</xsd:documentation>
|
||||||
|
</xsd:annotation>
|
||||||
|
</xsd:element>
|
||||||
|
<xsd:element name="lastModifiedBy" minOccurs="0" maxOccurs="1" type="xsd:string"/>
|
||||||
|
<xsd:element ref="dcterms:modified" minOccurs="0" maxOccurs="1"/>
|
||||||
|
<xsd:element name="contentStatus" minOccurs="0" maxOccurs="1" type="xsd:string"/>
|
||||||
|
</xsd:all>
|
||||||
|
</xsd:complexType>
|
||||||
|
</xsd:schema>
|
||||||
|
<xs:schema targetNamespace="http://schemas.microsoft.com/office/infopath/2007/PartnerControls" elementFormDefault="qualified" attributeFormDefault="unqualified" xmlns:pc="http://schemas.microsoft.com/office/infopath/2007/PartnerControls" xmlns:xs="http://www.w3.org/2001/XMLSchema">
|
||||||
|
<xs:element name="Person">
|
||||||
|
<xs:complexType>
|
||||||
|
<xs:sequence>
|
||||||
|
<xs:element ref="pc:DisplayName" minOccurs="0"></xs:element>
|
||||||
|
<xs:element ref="pc:AccountId" minOccurs="0"></xs:element>
|
||||||
|
<xs:element ref="pc:AccountType" minOccurs="0"></xs:element>
|
||||||
|
</xs:sequence>
|
||||||
|
</xs:complexType>
|
||||||
|
</xs:element>
|
||||||
|
<xs:element name="DisplayName" type="xs:string"></xs:element>
|
||||||
|
<xs:element name="AccountId" type="xs:string"></xs:element>
|
||||||
|
<xs:element name="AccountType" type="xs:string"></xs:element>
|
||||||
|
<xs:element name="BDCAssociatedEntity">
|
||||||
|
<xs:complexType>
|
||||||
|
<xs:sequence>
|
||||||
|
<xs:element ref="pc:BDCEntity" minOccurs="0" maxOccurs="unbounded"></xs:element>
|
||||||
|
</xs:sequence>
|
||||||
|
<xs:attribute ref="pc:EntityNamespace"></xs:attribute>
|
||||||
|
<xs:attribute ref="pc:EntityName"></xs:attribute>
|
||||||
|
<xs:attribute ref="pc:SystemInstanceName"></xs:attribute>
|
||||||
|
<xs:attribute ref="pc:AssociationName"></xs:attribute>
|
||||||
|
</xs:complexType>
|
||||||
|
</xs:element>
|
||||||
|
<xs:attribute name="EntityNamespace" type="xs:string"></xs:attribute>
|
||||||
|
<xs:attribute name="EntityName" type="xs:string"></xs:attribute>
|
||||||
|
<xs:attribute name="SystemInstanceName" type="xs:string"></xs:attribute>
|
||||||
|
<xs:attribute name="AssociationName" type="xs:string"></xs:attribute>
|
||||||
|
<xs:element name="BDCEntity">
|
||||||
|
<xs:complexType>
|
||||||
|
<xs:sequence>
|
||||||
|
<xs:element ref="pc:EntityDisplayName" minOccurs="0"></xs:element>
|
||||||
|
<xs:element ref="pc:EntityInstanceReference" minOccurs="0"></xs:element>
|
||||||
|
<xs:element ref="pc:EntityId1" minOccurs="0"></xs:element>
|
||||||
|
<xs:element ref="pc:EntityId2" minOccurs="0"></xs:element>
|
||||||
|
<xs:element ref="pc:EntityId3" minOccurs="0"></xs:element>
|
||||||
|
<xs:element ref="pc:EntityId4" minOccurs="0"></xs:element>
|
||||||
|
<xs:element ref="pc:EntityId5" minOccurs="0"></xs:element>
|
||||||
|
</xs:sequence>
|
||||||
|
</xs:complexType>
|
||||||
|
</xs:element>
|
||||||
|
<xs:element name="EntityDisplayName" type="xs:string"></xs:element>
|
||||||
|
<xs:element name="EntityInstanceReference" type="xs:string"></xs:element>
|
||||||
|
<xs:element name="EntityId1" type="xs:string"></xs:element>
|
||||||
|
<xs:element name="EntityId2" type="xs:string"></xs:element>
|
||||||
|
<xs:element name="EntityId3" type="xs:string"></xs:element>
|
||||||
|
<xs:element name="EntityId4" type="xs:string"></xs:element>
|
||||||
|
<xs:element name="EntityId5" type="xs:string"></xs:element>
|
||||||
|
<xs:element name="Terms">
|
||||||
|
<xs:complexType>
|
||||||
|
<xs:sequence>
|
||||||
|
<xs:element ref="pc:TermInfo" minOccurs="0" maxOccurs="unbounded"></xs:element>
|
||||||
|
</xs:sequence>
|
||||||
|
</xs:complexType>
|
||||||
|
</xs:element>
|
||||||
|
<xs:element name="TermInfo">
|
||||||
|
<xs:complexType>
|
||||||
|
<xs:sequence>
|
||||||
|
<xs:element ref="pc:TermName" minOccurs="0"></xs:element>
|
||||||
|
<xs:element ref="pc:TermId" minOccurs="0"></xs:element>
|
||||||
|
</xs:sequence>
|
||||||
|
</xs:complexType>
|
||||||
|
</xs:element>
|
||||||
|
<xs:element name="TermName" type="xs:string"></xs:element>
|
||||||
|
<xs:element name="TermId" type="xs:string"></xs:element>
|
||||||
|
</xs:schema>
|
||||||
|
</ct:contentTypeSchema>
|
||||||
1
thesis_output/plantilla_individual_files/item0021.xml
Normal file
@@ -0,0 +1 @@
|
|||||||
|
<?xml version="1.0" encoding="UTF-8" standalone="no"?><b:Sources SelectedStyle="\APASixthEditionOfficeOnline.xsl" StyleName="APA" Version="6" xmlns:b="http://schemas.openxmlformats.org/officeDocument/2006/bibliography" xmlns="http://schemas.openxmlformats.org/officeDocument/2006/bibliography"><b:Source><b:Tag>Dor81</b:Tag><b:SourceType>JournalArticle</b:SourceType><b:Guid>{D7C468B5-5E32-4254-9330-6DB2DDB01037}</b:Guid><b:Title>There's a S.M.A.R.T. way to write management's goals and objectives</b:Title><b:Year>1981</b:Year><b:Author><b:Author><b:NameList><b:Person><b:Last>Doran</b:Last><b:First>G.</b:First><b:Middle>T.</b:Middle></b:Person></b:NameList></b:Author></b:Author><b:JournalName>Management Review (AMA FORUM)</b:JournalName><b:Pages>35-36</b:Pages><b:Volume>70</b:Volume><b:RefOrder>1</b:RefOrder></b:Source></b:Sources>
|
||||||
1
thesis_output/plantilla_individual_files/item0023.xml
Normal file
@@ -0,0 +1 @@
|
|||||||
|
<?mso-contentType?><FormTemplates xmlns="http://schemas.microsoft.com/sharepoint/v3/contenttype/forms"><Display>DocumentLibraryForm</Display><Edit>DocumentLibraryForm</Edit><New>DocumentLibraryForm</New></FormTemplates>
|
||||||
2
thesis_output/plantilla_individual_files/props002.xml
Normal file
@@ -0,0 +1,2 @@
|
|||||||
|
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
|
||||||
|
<ds:datastoreItem ds:itemID="{B3A822E2-E694-47D5-9E22-DA4B12671ABB}" xmlns:ds="http://schemas.openxmlformats.org/officeDocument/2006/customXml"><ds:schemaRefs><ds:schemaRef ds:uri="http://schemas.microsoft.com/office/2006/metadata/contentType"/><ds:schemaRef ds:uri="http://schemas.microsoft.com/office/2006/metadata/properties/metaAttributes"/><ds:schemaRef ds:uri="http://www.w3.org/2001/XMLSchema"/><ds:schemaRef ds:uri="http://schemas.microsoft.com/office/2006/metadata/properties"/><ds:schemaRef ds:uri="0a70e875-3d35-4be2-921f-7117c31bab9b"/><ds:schemaRef ds:uri="27c1adeb-3674-457c-b08c-8a73f31b6e23"/><ds:schemaRef ds:uri="http://schemas.microsoft.com/office/2006/documentManagement/types"/><ds:schemaRef ds:uri="http://schemas.microsoft.com/office/infopath/2007/PartnerControls"/><ds:schemaRef ds:uri="http://schemas.openxmlformats.org/package/2006/metadata/core-properties"/><ds:schemaRef ds:uri="http://purl.org/dc/elements/1.1/"/><ds:schemaRef ds:uri="http://purl.org/dc/terms/"/><ds:schemaRef ds:uri="http://schemas.microsoft.com/internal/obd"/></ds:schemaRefs></ds:datastoreItem>
|
||||||
2
thesis_output/plantilla_individual_files/props004.xml
Normal file
@@ -0,0 +1,2 @@
|
|||||||
|
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
|
||||||
|
<ds:datastoreItem ds:itemID="{3CBD5336-2C2D-4DA8-8EBD-C205328B54AF}" xmlns:ds="http://schemas.openxmlformats.org/officeDocument/2006/customXml"><ds:schemaRefs><ds:schemaRef ds:uri="http://schemas.openxmlformats.org/officeDocument/2006/bibliography"/></ds:schemaRefs></ds:datastoreItem>
|
||||||
2
thesis_output/plantilla_individual_files/props006.xml
Normal file
@@ -0,0 +1,2 @@
|
|||||||
|
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
|
||||||
|
<ds:datastoreItem ds:itemID="{DB456AF2-52F5-44D8-AEC6-B5F9D96C377E}" xmlns:ds="http://schemas.openxmlformats.org/officeDocument/2006/customXml"><ds:schemaRefs><ds:schemaRef ds:uri="http://schemas.microsoft.com/office/2006/metadata/properties"/><ds:schemaRef ds:uri="http://schemas.microsoft.com/office/infopath/2007/PartnerControls"/><ds:schemaRef ds:uri="27c1adeb-3674-457c-b08c-8a73f31b6e23"/><ds:schemaRef ds:uri="0a70e875-3d35-4be2-921f-7117c31bab9b"/></ds:schemaRefs></ds:datastoreItem>
|
||||||
2
thesis_output/plantilla_individual_files/props008.xml
Normal file
@@ -0,0 +1,2 @@
|
|||||||
|
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
|
||||||
|
<ds:datastoreItem ds:itemID="{BE74C307-52FE-48C3-92C2-E1552852BAAA}" xmlns:ds="http://schemas.openxmlformats.org/officeDocument/2006/customXml"><ds:schemaRefs><ds:schemaRef ds:uri="http://schemas.microsoft.com/sharepoint/v3/contenttype/forms"/></ds:schemaRefs></ds:datastoreItem>
|
||||||
2
thesis_output/plantilla_individual_files/props014.xml
Normal file
@@ -0,0 +1,2 @@
|
|||||||
|
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
|
||||||
|
<ds:datastoreItem ds:itemID="{B3A822E2-E694-47D5-9E22-DA4B12671ABB}" xmlns:ds="http://schemas.openxmlformats.org/officeDocument/2006/customXml"><ds:schemaRefs><ds:schemaRef ds:uri="http://schemas.microsoft.com/office/2006/metadata/contentType"/><ds:schemaRef ds:uri="http://schemas.microsoft.com/office/2006/metadata/properties/metaAttributes"/><ds:schemaRef ds:uri="http://www.w3.org/2001/XMLSchema"/><ds:schemaRef ds:uri="http://schemas.microsoft.com/office/2006/metadata/properties"/><ds:schemaRef ds:uri="0a70e875-3d35-4be2-921f-7117c31bab9b"/><ds:schemaRef ds:uri="27c1adeb-3674-457c-b08c-8a73f31b6e23"/><ds:schemaRef ds:uri="http://schemas.microsoft.com/office/2006/documentManagement/types"/><ds:schemaRef ds:uri="http://schemas.microsoft.com/office/infopath/2007/PartnerControls"/><ds:schemaRef ds:uri="http://schemas.openxmlformats.org/package/2006/metadata/core-properties"/><ds:schemaRef ds:uri="http://purl.org/dc/elements/1.1/"/><ds:schemaRef ds:uri="http://purl.org/dc/terms/"/><ds:schemaRef ds:uri="http://schemas.microsoft.com/internal/obd"/></ds:schemaRefs></ds:datastoreItem>
|
||||||
2
thesis_output/plantilla_individual_files/props016.xml
Normal file
@@ -0,0 +1,2 @@
|
|||||||
|
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
|
||||||
|
<ds:datastoreItem ds:itemID="{3CBD5336-2C2D-4DA8-8EBD-C205328B54AF}" xmlns:ds="http://schemas.openxmlformats.org/officeDocument/2006/customXml"><ds:schemaRefs><ds:schemaRef ds:uri="http://schemas.openxmlformats.org/officeDocument/2006/bibliography"/></ds:schemaRefs></ds:datastoreItem>
|
||||||
2
thesis_output/plantilla_individual_files/props018.xml
Normal file
@@ -0,0 +1,2 @@
|
|||||||
|
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
|
||||||
|
<ds:datastoreItem ds:itemID="{BE74C307-52FE-48C3-92C2-E1552852BAAA}" xmlns:ds="http://schemas.openxmlformats.org/officeDocument/2006/customXml"><ds:schemaRefs><ds:schemaRef ds:uri="http://schemas.microsoft.com/sharepoint/v3/contenttype/forms"/></ds:schemaRefs></ds:datastoreItem>
|
||||||
2
thesis_output/plantilla_individual_files/props020.xml
Normal file
@@ -0,0 +1,2 @@
|
|||||||
|
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
|
||||||
|
<ds:datastoreItem ds:itemID="{B3A822E2-E694-47D5-9E22-DA4B12671ABB}" xmlns:ds="http://schemas.openxmlformats.org/officeDocument/2006/customXml"><ds:schemaRefs><ds:schemaRef ds:uri="http://schemas.microsoft.com/office/2006/metadata/contentType"/><ds:schemaRef ds:uri="http://schemas.microsoft.com/office/2006/metadata/properties/metaAttributes"/><ds:schemaRef ds:uri="http://www.w3.org/2001/XMLSchema"/><ds:schemaRef ds:uri="http://schemas.microsoft.com/office/2006/metadata/properties"/><ds:schemaRef ds:uri="0a70e875-3d35-4be2-921f-7117c31bab9b"/><ds:schemaRef ds:uri="27c1adeb-3674-457c-b08c-8a73f31b6e23"/><ds:schemaRef ds:uri="http://schemas.microsoft.com/office/2006/documentManagement/types"/><ds:schemaRef ds:uri="http://schemas.microsoft.com/office/infopath/2007/PartnerControls"/><ds:schemaRef ds:uri="http://schemas.openxmlformats.org/package/2006/metadata/core-properties"/><ds:schemaRef ds:uri="http://purl.org/dc/elements/1.1/"/><ds:schemaRef ds:uri="http://purl.org/dc/terms/"/><ds:schemaRef ds:uri="http://schemas.microsoft.com/internal/obd"/></ds:schemaRefs></ds:datastoreItem>
|
||||||
2
thesis_output/plantilla_individual_files/props022.xml
Normal file
@@ -0,0 +1,2 @@
|
|||||||
|
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
|
||||||
|
<ds:datastoreItem ds:itemID="{3CBD5336-2C2D-4DA8-8EBD-C205328B54AF}" xmlns:ds="http://schemas.openxmlformats.org/officeDocument/2006/customXml"><ds:schemaRefs><ds:schemaRef ds:uri="http://schemas.openxmlformats.org/officeDocument/2006/bibliography"/></ds:schemaRefs></ds:datastoreItem>
|
||||||
2
thesis_output/plantilla_individual_files/props024.xml
Normal file
@@ -0,0 +1,2 @@
|
|||||||
|
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
|
||||||
|
<ds:datastoreItem ds:itemID="{BE74C307-52FE-48C3-92C2-E1552852BAAA}" xmlns:ds="http://schemas.openxmlformats.org/officeDocument/2006/customXml"><ds:schemaRefs><ds:schemaRef ds:uri="http://schemas.microsoft.com/sharepoint/v3/contenttype/forms"/></ds:schemaRefs></ds:datastoreItem>
|
||||||