From 4fe661fbe5aa22bf72368e139771fe83a0524fe7 Mon Sep 17 00:00:00 2001 From: Sergio Jimenez Jimenez Date: Sun, 18 Jan 2026 07:05:27 +0100 Subject: [PATCH] docs --- src/paddle_ocr/README.md | 137 ++++++++++++++++++++++++++++++++++++++- 1 file changed, 136 insertions(+), 1 deletion(-) diff --git a/src/paddle_ocr/README.md b/src/paddle_ocr/README.md index 99c3ebf..1cfaa36 100644 --- a/src/paddle_ocr/README.md +++ b/src/paddle_ocr/README.md @@ -182,7 +182,7 @@ This section documents GPU support findings based on testing on an NVIDIA DGX Sp | Windows x64 | ✅ | ✅ CUDA 10.2/11.x/12.x | | macOS x64 | ✅ | ❌ | | macOS ARM64 (M1/M2) | ✅ | ❌ | -| Linux ARM64 (Jetson/DGX) | ✅ | ✅ Custom wheel required | +| Linux ARM64 (Jetson/DGX) | ✅ | ⚠️ Limited - see Blackwell note | **Source:** [PaddlePaddle-GPU PyPI](https://pypi.org/project/paddlepaddle-gpu/) - only `manylinux_x86_64` and `win_amd64` wheels available on PyPI. ARM64 wheels must be built from source or downloaded from Gitea packages. @@ -199,6 +199,141 @@ ARM64 GPU support is available but requires custom-built wheels: - Build the wheel locally using `Dockerfile.build-paddle` (see Option 2 below), or - Download the wheel from Gitea packages: `wheels/paddlepaddle_gpu-3.0.0-cp311-cp311-linux_aarch64.whl` +### ⚠️ Known Limitation: Blackwell GPU (sm_121 / GB10) + +**Status: GPU inference does NOT work on NVIDIA Blackwell GPUs (DGX Spark, GB200, etc.)** + +#### Symptoms + +When running PaddleOCR on Blackwell GPUs: +- CUDA loads successfully ✅ +- Basic tensor operations work ✅ +- **Detection model outputs constant values** ❌ +- 0 text regions detected +- CER/WER = 100% (nothing recognized) + +#### Root Cause + +PaddleOCR uses **pre-compiled inference models** (PP-OCRv4_mobile_det, PP-OCRv5_server_det, etc.) that contain embedded CUDA kernels. These kernels were compiled for older GPU architectures (sm_80 Ampere, sm_90 Hopper) and **do not support Blackwell (sm_121)**. + +**Why building PaddlePaddle from source doesn't fix it:** + +1. ✅ You can build `paddlepaddle-gpu` with `CUDA_ARCH=121` - this creates a Blackwell-compatible framework +2. ❌ But the **PaddleOCR inference models** (`.pdiparams`, `.pdmodel` files) contain pre-compiled CUDA ops +3. ❌ These model files were exported/compiled targeting sm_80/sm_90 architectures +4. ❌ The model kernels execute on GPU but produce garbage output on sm_121 + +**To truly fix this**, the PaddlePaddle team would need to: +1. Add sm_121 to their model export pipeline +2. Re-export all PaddleOCR models (PP-OCRv4, PP-OCRv5, etc.) with Blackwell support +3. Release new model versions + +This is tracked in [GitHub Issue #17327](https://github.com/PaddlePaddle/PaddleOCR/issues/17327). + +#### Debug Script + +Use the included debug script to verify this issue: + +```bash +docker exec paddle-ocr-gpu python /app/scripts/debug_gpu_detection.py /app/dataset/0/img/page_0001.png +``` + +Expected output showing the problem: +``` +OUTPUT ANALYSIS: + Shape: (1, 1, 640, 640) + Min: 0.000010 + Max: 0.000010 # <-- Same as min = constant output + Mean: 0.000010 + +DIAGNOSIS: + PROBLEM: Output is constant - model inference is broken! + This typically indicates GPU compute capability mismatch. +``` + +#### Workarounds + +1. **Use CPU mode** (recommended): + ```bash + docker compose up ocr-cpu + ``` + The ARM Grace CPU is fast (~2-5 sec/page). This is the reliable option. + +2. **Use EasyOCR or DocTR with GPU**: + These use PyTorch which has official ARM64 CUDA wheels (cu128 index): + ```bash + # EasyOCR with GPU on DGX Spark + docker build -f ../easyocr_service/Dockerfile.gpu -t easyocr-gpu ../easyocr_service + docker run --gpus all -p 8002:8000 easyocr-gpu + ``` + +3. **Wait for PaddlePaddle Blackwell support**: + Track [GitHub Issue #17327](https://github.com/PaddlePaddle/PaddleOCR/issues/17327) for updates. + +#### GPU Support Matrix (Updated) + +| GPU Architecture | Compute | CPU | GPU | +|------------------|---------|-----|-----| +| Ampere (A100, A10) | sm_80 | ✅ | ✅ | +| Hopper (H100, H200) | sm_90 | ✅ | ✅ | +| **Blackwell (GB10, GB200)** | sm_121 | ✅ | ❌ Not supported | + +#### FAQ: Why Doesn't CUDA Backward Compatibility Work? + +**Q: CUDA normally runs older kernels on newer GPUs. Why doesn't this work for Blackwell?** + +Per [NVIDIA Blackwell Compatibility Guide](https://docs.nvidia.com/cuda/blackwell-compatibility-guide/): + +CUDA **can** run older code on newer GPUs via **PTX JIT compilation**: +1. PTX (Parallel Thread Execution) is NVIDIA's intermediate representation +2. If an app includes PTX code, the driver JIT-compiles it for the target GPU +3. This allows sm_80 code to run on sm_121 + +**The problem**: PaddleOCR inference models contain only pre-compiled **cubins** (SASS binary), not PTX. Without PTX, there's nothing to JIT-compile. + +You can test if PTX exists: +```bash +# Force PTX JIT compilation +docker run --gpus all -e CUDA_FORCE_PTX_JIT=1 paddle-ocr-gpu \ + python /app/scripts/debug_gpu_detection.py /app/dataset/0/img/page_0001.png +``` +- If output is still constant → No PTX in models (confirmed) +- If output varies → PTX worked + +**Note on sm_121**: Per NVIDIA docs, "sm_121 is the same as sm_120 since the only difference is physically integrated CPU+GPU memory of Spark." The issue is general Blackwell (sm_12x) support, not Spark-specific. + +#### FAQ: Can I Run AMD64 Containers on ARM64 DGX Spark? + +**Q: Can I just run the working x86_64 GPU image via emulation?** + +**Short answer: Yes for CPU, No for GPU.** + +You can run amd64 containers via QEMU emulation: +```bash +# Install QEMU +sudo apt-get install qemu binfmt-support qemu-user-static +docker run --rm --privileged multiarch/qemu-user-static --reset -p yes + +# Run amd64 container +docker run --platform linux/amd64 paddle-ocr-gpu:amd64 ... +``` + +**But GPU doesn't work:** +- QEMU emulates CPU instructions (x86 → ARM) +- **QEMU user-mode does NOT support GPU passthrough** +- GPU calls from emulated x86 code cannot reach the ARM64 GPU + +So even if the amd64 image works on x86_64: +- ❌ No GPU access through QEMU +- ❌ CPU emulation is 10-100x slower than native ARM64 +- ❌ Defeats the purpose entirely + +| Approach | CPU | GPU | Speed | +|----------|-----|-----|-------| +| ARM64 native (CPU) | ✅ | N/A | Fast (~2-5s/page) | +| ARM64 native (GPU) | ✅ | ❌ Blackwell issue | - | +| AMD64 via QEMU | ⚠️ Works | ❌ No passthrough | 10-100x slower | + ### Options for ARM64 Systems #### Option 1: CPU-Only (Recommended)