docs

2026-01-18 18:54:34 +01:00
parent e2cca72cf2
commit 458ff5d831
4 changed files with 161 additions and 11 deletions
--- a/src/README.md
+++ b/src/README.md
@@ -2,25 +2,31 @@

 ## Quick: Check Ray Tune Progress

-**Current run:** PaddleOCR hyperparameter optimization via Ray Tune + Optuna.
- 64 trials searching for optimal detection/recognition thresholds
- 2 CPU workers running in parallel (Docker containers on ports 8001-8002)
- Notebook: `paddle_ocr_raytune_rest.ipynb` → `output_raytune.ipynb`
- Results saved to: `~/ray_results/trainable_paddle_ocr_2026-01-18_17-25-43/`
-
 ```bash
-# Is it still running?
+# Is papermill still running?
 ps aux | grep papermill | grep -v grep

 # View live log
 tail -f papermill.log

-# Count completed trials (64 total)
-find ~/ray_results/trainable_paddle_ocr_2026-01-18_17-25-43/ -name "result.json" ! -empty | wc -l
+# Find latest Ray Tune run and count completed trials
+LATEST=$(ls -td ~/ray_results/trainable_* 2>/dev/null | head -1)
+echo "Run: $LATEST"
+COMPLETED=$(find "$LATEST" -name "result.json" -size +0 2>/dev/null | wc -l)
+TOTAL=$(ls -d "$LATEST"/trainable_*/ 2>/dev/null | wc -l)
+echo "Completed: $COMPLETED / $TOTAL"

 # Check workers are healthy
-curl -s localhost:8001/health | jq -r '.status'
-curl -s localhost:8002/health | jq -r '.status'
+for port in 8001 8002 8003; do
+  status=$(curl -s "localhost:$port/health" 2>/dev/null | python3 -c "import sys,json; print(json.load(sys.stdin).get('status','down'))" 2>/dev/null || echo "down")
+  echo "Worker $port: $status"
+done
+
+# Show best result so far
+if [ "$COMPLETED" -gt 0 ]; then
+  find "$LATEST" -name "result.json" -size +0 -exec cat {} \; 2>/dev/null | \
+    python3 -c "import sys,json; results=[json.loads(l) for l in sys.stdin if l.strip()]; best=min(results,key=lambda x:x.get('CER',999)); print(f'Best CER: {best[\"CER\"]:.4f}, WER: {best[\"WER\"]:.4f}')" 2>/dev/null
+fi
 ```

 ---
--- a/src/doctr_service/README.md
+++ b/src/doctr_service/README.md
@@ -101,6 +101,55 @@ Run OCR evaluation with given hyperparameters.

 **Note:** `model_reinitialized` indicates if the model was reloaded due to changed processing flags (adds ~2-5s overhead).

+## Debug Output (debugset)
+
+The `debugset` folder allows saving OCR predictions for debugging and analysis. When `save_output=True` is passed to `/evaluate`, predictions are written to `/app/debugset`.
+
+### Enable Debug Output
+
+```json
+{
+  "pdf_folder": "/app/dataset",
+  "save_output": true,
+  "start_page": 5,
+  "end_page": 10
+}
+```
+
+### Output Structure
+
+```
+debugset/
+├── doc1/
+│   └── doctr/
+│       ├── page_0005.txt
+│       ├── page_0006.txt
+│       └── ...
+├── doc2/
+│   └── doctr/
+│       └── ...
+```
+
+Each `.txt` file contains the OCR-extracted text for that page.
+
+### Docker Mount
+
+Add the debugset volume to your docker run command:
+
+```bash
+docker run -d -p 8003:8000 \
+  -v $(pwd)/../dataset:/app/dataset:ro \
+  -v $(pwd)/../debugset:/app/debugset:rw \
+  -v doctr-cache:/root/.cache/doctr \
+  doctr-api:cpu
+```
+
+### Use Cases
+
+- **Compare OCR engines**: Run same pages through PaddleOCR, DocTR, EasyOCR with `save_output=True`, then diff results
+- **Debug hyperparameters**: See how different settings affect text extraction
+- **Ground truth comparison**: Compare predictions against expected output
+
 ## Hyperparameters

 ### Processing Flags (Require Model Reinitialization)
--- a/src/easyocr_service/README.md
+++ b/src/easyocr_service/README.md
@@ -96,6 +96,55 @@ Run OCR evaluation with given hyperparameters.
 {"CER": 0.0234, "WER": 0.1156, "TIME": 45.2, "PAGES": 5, "TIME_PER_PAGE": 9.04}
 ```

+## Debug Output (debugset)
+
+The `debugset` folder allows saving OCR predictions for debugging and analysis. When `save_output=True` is passed to `/evaluate`, predictions are written to `/app/debugset`.
+
+### Enable Debug Output
+
+```json
+{
+  "pdf_folder": "/app/dataset",
+  "save_output": true,
+  "start_page": 5,
+  "end_page": 10
+}
+```
+
+### Output Structure
+
+```
+debugset/
+├── doc1/
+│   └── easyocr/
+│       ├── page_0005.txt
+│       ├── page_0006.txt
+│       └── ...
+├── doc2/
+│   └── easyocr/
+│       └── ...
+```
+
+Each `.txt` file contains the OCR-extracted text for that page.
+
+### Docker Mount
+
+Add the debugset volume to your docker run command:
+
+```bash
+docker run -d -p 8002:8000 \
+  -v $(pwd)/../dataset:/app/dataset:ro \
+  -v $(pwd)/../debugset:/app/debugset:rw \
+  -v easyocr-cache:/root/.EasyOCR \
+  easyocr-api:cpu
+```
+
+### Use Cases
+
+- **Compare OCR engines**: Run same pages through PaddleOCR, DocTR, EasyOCR with `save_output=True`, then diff results
+- **Debug hyperparameters**: See how different settings affect text extraction
+- **Ground truth comparison**: Compare predictions against expected output
+
 ## Hyperparameters

 ### Detection (CRAFT Algorithm)
--- a/src/paddle_ocr/README.md
+++ b/src/paddle_ocr/README.md
@@ -110,6 +110,52 @@ Run OCR evaluation with given hyperparameters.
 ### `POST /evaluate_full`
 Same as `/evaluate` but runs on ALL pages (ignores start_page/end_page).

+## Debug Output (debugset)
+
+The `debugset` folder allows saving OCR predictions for debugging and analysis. When `save_output=True` is passed to `/evaluate`, predictions are written to `/app/debugset`.
+
+### Enable Debug Output
+
+```json
+{
+  "pdf_folder": "/app/dataset",
+  "save_output": true,
+  "start_page": 5,
+  "end_page": 10
+}
+```
+
+### Output Structure
+
+```
+debugset/
+├── doc1/
+│   └── paddle_ocr/
+│       ├── page_0005.txt
+│       ├── page_0006.txt
+│       └── ...
+├── doc2/
+│   └── paddle_ocr/
+│       └── ...
+```
+
+Each `.txt` file contains the OCR-extracted text for that page.
+
+### Docker Mount
+
+The `debugset` folder is mounted read-write in docker-compose:
+
+```yaml
+volumes:
+  - ../debugset:/app/debugset:rw
+```
+
+### Use Cases
+
+- **Compare OCR engines**: Run same pages through PaddleOCR, DocTR, EasyOCR with `save_output=True`, then diff results
+- **Debug hyperparameters**: See how different settings affect text extraction
+- **Ground truth comparison**: Compare predictions against expected output
+
 ## Building Images

 ### CPU Image (Multi-Architecture)