docs

2026-01-18 18:54:34 +01:00
parent e2cca72cf2
commit 458ff5d831
4 changed files with 161 additions and 11 deletions
--- a/src/README.md
+++ b/src/README.md
@@ -2,25 +2,31 @@
 ## Quick: Check Ray Tune Progress
 **Current run:** PaddleOCR hyperparameter optimization via Ray Tune + Optuna.
 - 64 trials searching for optimal detection/recognition thresholds
 - 2 CPU workers running in parallel (Docker containers on ports 8001-8002)
 - Notebook: `paddle_ocr_raytune_rest.ipynb` → `output_raytune.ipynb`
 - Results saved to: `~/ray_results/trainable_paddle_ocr_2026-01-18_17-25-43/`
 ```bash
-# Is it still running?
+# Is papermill still running?
 ps aux | grep papermill | grep -v grep
 # View live log
 tail -f papermill.log
-# Count completed trials (64 total)
+# Find latest Ray Tune run and count completed trials
-find ~/ray_results/trainable_paddle_ocr_2026-01-18_17-25-43/ -name "result.json" ! -empty | wc -l
+LATEST=$(ls -td ~/ray_results/trainable_* 2>/dev/null | head -1)
 echo "Run: $LATEST"
 COMPLETED=$(find "$LATEST" -name "result.json" -size +0 2>/dev/null | wc -l)
 TOTAL=$(ls -d "$LATEST"/trainable_*/ 2>/dev/null | wc -l)
 echo "Completed: $COMPLETED / $TOTAL"
 # Check workers are healthy
-curl -s localhost:8001/health | jq -r '.status'
+for port in 8001 8002 8003; do
-curl -s localhost:8002/health | jq -r '.status'
+  status=$(curl -s "localhost:$port/health" 2>/dev/null | python3 -c "import sys,json; print(json.load(sys.stdin).get('status','down'))" 2>/dev/null || echo "down")
  echo "Worker $port: $status"
 done
 # Show best result so far
 if [ "$COMPLETED" -gt 0 ]; then
  find "$LATEST" -name "result.json" -size +0 -exec cat {} \; 2>/dev/null | \
    python3 -c "import sys,json; results=[json.loads(l) for l in sys.stdin if l.strip()]; best=min(results,key=lambda x:x.get('CER',999)); print(f'Best CER: {best[\"CER\"]:.4f}, WER: {best[\"WER\"]:.4f}')" 2>/dev/null
 fi
 ```
 ---
--- a/src/doctr_service/README.md
+++ b/src/doctr_service/README.md
@@ -101,6 +101,55 @@ Run OCR evaluation with given hyperparameters.
 **Note:** `model_reinitialized` indicates if the model was reloaded due to changed processing flags (adds ~2-5s overhead).
 ## Debug Output (debugset)
 The `debugset` folder allows saving OCR predictions for debugging and analysis. When `save_output=True` is passed to `/evaluate`, predictions are written to `/app/debugset`.
 ### Enable Debug Output
 ```json
 {
  "pdf_folder": "/app/dataset",
  "save_output": true,
  "start_page": 5,
  "end_page": 10
 }
 ```
 ### Output Structure
 ```
 debugset/
 ├── doc1/
 │   └── doctr/
 │       ├── page_0005.txt
 │       ├── page_0006.txt
 │       └── ...
 ├── doc2/
 │   └── doctr/
 │       └── ...
 ```
 Each `.txt` file contains the OCR-extracted text for that page.
 ### Docker Mount
 Add the debugset volume to your docker run command:
 ```bash
 docker run -d -p 8003:8000 \
  -v $(pwd)/../dataset:/app/dataset:ro \
  -v $(pwd)/../debugset:/app/debugset:rw \
  -v doctr-cache:/root/.cache/doctr \
  doctr-api:cpu
 ```
 ### Use Cases
 - **Compare OCR engines**: Run same pages through PaddleOCR, DocTR, EasyOCR with `save_output=True`, then diff results
 - **Debug hyperparameters**: See how different settings affect text extraction
 - **Ground truth comparison**: Compare predictions against expected output
 ## Hyperparameters
 ### Processing Flags (Require Model Reinitialization)
--- a/src/easyocr_service/README.md
+++ b/src/easyocr_service/README.md
@@ -96,6 +96,55 @@ Run OCR evaluation with given hyperparameters.
 {"CER": 0.0234, "WER": 0.1156, "TIME": 45.2, "PAGES": 5, "TIME_PER_PAGE": 9.04}
 ```
 ## Debug Output (debugset)
 The `debugset` folder allows saving OCR predictions for debugging and analysis. When `save_output=True` is passed to `/evaluate`, predictions are written to `/app/debugset`.
 ### Enable Debug Output
 ```json
 {
  "pdf_folder": "/app/dataset",
  "save_output": true,
  "start_page": 5,
  "end_page": 10
 }
 ```
 ### Output Structure
 ```
 debugset/
 ├── doc1/
 │   └── easyocr/
 │       ├── page_0005.txt
 │       ├── page_0006.txt
 │       └── ...
 ├── doc2/
 │   └── easyocr/
 │       └── ...
 ```
 Each `.txt` file contains the OCR-extracted text for that page.
 ### Docker Mount
 Add the debugset volume to your docker run command:
 ```bash
 docker run -d -p 8002:8000 \
  -v $(pwd)/../dataset:/app/dataset:ro \
  -v $(pwd)/../debugset:/app/debugset:rw \
  -v easyocr-cache:/root/.EasyOCR \
  easyocr-api:cpu
 ```
 ### Use Cases
 - **Compare OCR engines**: Run same pages through PaddleOCR, DocTR, EasyOCR with `save_output=True`, then diff results
 - **Debug hyperparameters**: See how different settings affect text extraction
 - **Ground truth comparison**: Compare predictions against expected output
 ## Hyperparameters
 ### Detection (CRAFT Algorithm)
--- a/src/paddle_ocr/README.md
+++ b/src/paddle_ocr/README.md
@@ -110,6 +110,52 @@ Run OCR evaluation with given hyperparameters.
 ### `POST /evaluate_full`
 Same as `/evaluate` but runs on ALL pages (ignores start_page/end_page).
 ## Debug Output (debugset)
 The `debugset` folder allows saving OCR predictions for debugging and analysis. When `save_output=True` is passed to `/evaluate`, predictions are written to `/app/debugset`.
 ### Enable Debug Output
 ```json
 {
  "pdf_folder": "/app/dataset",
  "save_output": true,
  "start_page": 5,
  "end_page": 10
 }
 ```
 ### Output Structure
 ```
 debugset/
 ├── doc1/
 │   └── paddle_ocr/
 │       ├── page_0005.txt
 │       ├── page_0006.txt
 │       └── ...
 ├── doc2/
 │   └── paddle_ocr/
 │       └── ...
 ```
 Each `.txt` file contains the OCR-extracted text for that page.
 ### Docker Mount
 The `debugset` folder is mounted read-write in docker-compose:
 ```yaml
 volumes:
  - ../debugset:/app/debugset:rw
 ```
 ### Use Cases
 - **Compare OCR engines**: Run same pages through PaddleOCR, DocTR, EasyOCR with `save_output=True`, then diff results
 - **Debug hyperparameters**: See how different settings affect text extraction
 - **Ground truth comparison**: Compare predictions against expected output
 ## Building Images
 ### CPU Image (Multi-Architecture)