Paddle ocr gpu support. #4
@@ -182,7 +182,7 @@ This section documents GPU support findings based on testing on an NVIDIA DGX Sp
|
|||||||
| Windows x64 | ✅ | ✅ CUDA 10.2/11.x/12.x |
|
| Windows x64 | ✅ | ✅ CUDA 10.2/11.x/12.x |
|
||||||
| macOS x64 | ✅ | ❌ |
|
| macOS x64 | ✅ | ❌ |
|
||||||
| macOS ARM64 (M1/M2) | ✅ | ❌ |
|
| macOS ARM64 (M1/M2) | ✅ | ❌ |
|
||||||
| Linux ARM64 (Jetson/DGX) | ✅ | ✅ Custom wheel required |
|
| Linux ARM64 (Jetson/DGX) | ✅ | ⚠️ Limited - see Blackwell note |
|
||||||
|
|
||||||
**Source:** [PaddlePaddle-GPU PyPI](https://pypi.org/project/paddlepaddle-gpu/) - only `manylinux_x86_64` and `win_amd64` wheels available on PyPI. ARM64 wheels must be built from source or downloaded from Gitea packages.
|
**Source:** [PaddlePaddle-GPU PyPI](https://pypi.org/project/paddlepaddle-gpu/) - only `manylinux_x86_64` and `win_amd64` wheels available on PyPI. ARM64 wheels must be built from source or downloaded from Gitea packages.
|
||||||
|
|
||||||
@@ -199,6 +199,141 @@ ARM64 GPU support is available but requires custom-built wheels:
|
|||||||
- Build the wheel locally using `Dockerfile.build-paddle` (see Option 2 below), or
|
- Build the wheel locally using `Dockerfile.build-paddle` (see Option 2 below), or
|
||||||
- Download the wheel from Gitea packages: `wheels/paddlepaddle_gpu-3.0.0-cp311-cp311-linux_aarch64.whl`
|
- Download the wheel from Gitea packages: `wheels/paddlepaddle_gpu-3.0.0-cp311-cp311-linux_aarch64.whl`
|
||||||
|
|
||||||
|
### ⚠️ Known Limitation: Blackwell GPU (sm_121 / GB10)
|
||||||
|
|
||||||
|
**Status: GPU inference does NOT work on NVIDIA Blackwell GPUs (DGX Spark, GB200, etc.)**
|
||||||
|
|
||||||
|
#### Symptoms
|
||||||
|
|
||||||
|
When running PaddleOCR on Blackwell GPUs:
|
||||||
|
- CUDA loads successfully ✅
|
||||||
|
- Basic tensor operations work ✅
|
||||||
|
- **Detection model outputs constant values** ❌
|
||||||
|
- 0 text regions detected
|
||||||
|
- CER/WER = 100% (nothing recognized)
|
||||||
|
|
||||||
|
#### Root Cause
|
||||||
|
|
||||||
|
PaddleOCR uses **pre-compiled inference models** (PP-OCRv4_mobile_det, PP-OCRv5_server_det, etc.) that contain embedded CUDA kernels. These kernels were compiled for older GPU architectures (sm_80 Ampere, sm_90 Hopper) and **do not support Blackwell (sm_121)**.
|
||||||
|
|
||||||
|
**Why building PaddlePaddle from source doesn't fix it:**
|
||||||
|
|
||||||
|
1. ✅ You can build `paddlepaddle-gpu` with `CUDA_ARCH=121` - this creates a Blackwell-compatible framework
|
||||||
|
2. ❌ But the **PaddleOCR inference models** (`.pdiparams`, `.pdmodel` files) contain pre-compiled CUDA ops
|
||||||
|
3. ❌ These model files were exported/compiled targeting sm_80/sm_90 architectures
|
||||||
|
4. ❌ The model kernels execute on GPU but produce garbage output on sm_121
|
||||||
|
|
||||||
|
**To truly fix this**, the PaddlePaddle team would need to:
|
||||||
|
1. Add sm_121 to their model export pipeline
|
||||||
|
2. Re-export all PaddleOCR models (PP-OCRv4, PP-OCRv5, etc.) with Blackwell support
|
||||||
|
3. Release new model versions
|
||||||
|
|
||||||
|
This is tracked in [GitHub Issue #17327](https://github.com/PaddlePaddle/PaddleOCR/issues/17327).
|
||||||
|
|
||||||
|
#### Debug Script
|
||||||
|
|
||||||
|
Use the included debug script to verify this issue:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
docker exec paddle-ocr-gpu python /app/scripts/debug_gpu_detection.py /app/dataset/0/img/page_0001.png
|
||||||
|
```
|
||||||
|
|
||||||
|
Expected output showing the problem:
|
||||||
|
```
|
||||||
|
OUTPUT ANALYSIS:
|
||||||
|
Shape: (1, 1, 640, 640)
|
||||||
|
Min: 0.000010
|
||||||
|
Max: 0.000010 # <-- Same as min = constant output
|
||||||
|
Mean: 0.000010
|
||||||
|
|
||||||
|
DIAGNOSIS:
|
||||||
|
PROBLEM: Output is constant - model inference is broken!
|
||||||
|
This typically indicates GPU compute capability mismatch.
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Workarounds
|
||||||
|
|
||||||
|
1. **Use CPU mode** (recommended):
|
||||||
|
```bash
|
||||||
|
docker compose up ocr-cpu
|
||||||
|
```
|
||||||
|
The ARM Grace CPU is fast (~2-5 sec/page). This is the reliable option.
|
||||||
|
|
||||||
|
2. **Use EasyOCR or DocTR with GPU**:
|
||||||
|
These use PyTorch which has official ARM64 CUDA wheels (cu128 index):
|
||||||
|
```bash
|
||||||
|
# EasyOCR with GPU on DGX Spark
|
||||||
|
docker build -f ../easyocr_service/Dockerfile.gpu -t easyocr-gpu ../easyocr_service
|
||||||
|
docker run --gpus all -p 8002:8000 easyocr-gpu
|
||||||
|
```
|
||||||
|
|
||||||
|
3. **Wait for PaddlePaddle Blackwell support**:
|
||||||
|
Track [GitHub Issue #17327](https://github.com/PaddlePaddle/PaddleOCR/issues/17327) for updates.
|
||||||
|
|
||||||
|
#### GPU Support Matrix (Updated)
|
||||||
|
|
||||||
|
| GPU Architecture | Compute | CPU | GPU |
|
||||||
|
|------------------|---------|-----|-----|
|
||||||
|
| Ampere (A100, A10) | sm_80 | ✅ | ✅ |
|
||||||
|
| Hopper (H100, H200) | sm_90 | ✅ | ✅ |
|
||||||
|
| **Blackwell (GB10, GB200)** | sm_121 | ✅ | ❌ Not supported |
|
||||||
|
|
||||||
|
#### FAQ: Why Doesn't CUDA Backward Compatibility Work?
|
||||||
|
|
||||||
|
**Q: CUDA normally runs older kernels on newer GPUs. Why doesn't this work for Blackwell?**
|
||||||
|
|
||||||
|
Per [NVIDIA Blackwell Compatibility Guide](https://docs.nvidia.com/cuda/blackwell-compatibility-guide/):
|
||||||
|
|
||||||
|
CUDA **can** run older code on newer GPUs via **PTX JIT compilation**:
|
||||||
|
1. PTX (Parallel Thread Execution) is NVIDIA's intermediate representation
|
||||||
|
2. If an app includes PTX code, the driver JIT-compiles it for the target GPU
|
||||||
|
3. This allows sm_80 code to run on sm_121
|
||||||
|
|
||||||
|
**The problem**: PaddleOCR inference models contain only pre-compiled **cubins** (SASS binary), not PTX. Without PTX, there's nothing to JIT-compile.
|
||||||
|
|
||||||
|
You can test if PTX exists:
|
||||||
|
```bash
|
||||||
|
# Force PTX JIT compilation
|
||||||
|
docker run --gpus all -e CUDA_FORCE_PTX_JIT=1 paddle-ocr-gpu \
|
||||||
|
python /app/scripts/debug_gpu_detection.py /app/dataset/0/img/page_0001.png
|
||||||
|
```
|
||||||
|
- If output is still constant → No PTX in models (confirmed)
|
||||||
|
- If output varies → PTX worked
|
||||||
|
|
||||||
|
**Note on sm_121**: Per NVIDIA docs, "sm_121 is the same as sm_120 since the only difference is physically integrated CPU+GPU memory of Spark." The issue is general Blackwell (sm_12x) support, not Spark-specific.
|
||||||
|
|
||||||
|
#### FAQ: Can I Run AMD64 Containers on ARM64 DGX Spark?
|
||||||
|
|
||||||
|
**Q: Can I just run the working x86_64 GPU image via emulation?**
|
||||||
|
|
||||||
|
**Short answer: Yes for CPU, No for GPU.**
|
||||||
|
|
||||||
|
You can run amd64 containers via QEMU emulation:
|
||||||
|
```bash
|
||||||
|
# Install QEMU
|
||||||
|
sudo apt-get install qemu binfmt-support qemu-user-static
|
||||||
|
docker run --rm --privileged multiarch/qemu-user-static --reset -p yes
|
||||||
|
|
||||||
|
# Run amd64 container
|
||||||
|
docker run --platform linux/amd64 paddle-ocr-gpu:amd64 ...
|
||||||
|
```
|
||||||
|
|
||||||
|
**But GPU doesn't work:**
|
||||||
|
- QEMU emulates CPU instructions (x86 → ARM)
|
||||||
|
- **QEMU user-mode does NOT support GPU passthrough**
|
||||||
|
- GPU calls from emulated x86 code cannot reach the ARM64 GPU
|
||||||
|
|
||||||
|
So even if the amd64 image works on x86_64:
|
||||||
|
- ❌ No GPU access through QEMU
|
||||||
|
- ❌ CPU emulation is 10-100x slower than native ARM64
|
||||||
|
- ❌ Defeats the purpose entirely
|
||||||
|
|
||||||
|
| Approach | CPU | GPU | Speed |
|
||||||
|
|----------|-----|-----|-------|
|
||||||
|
| ARM64 native (CPU) | ✅ | N/A | Fast (~2-5s/page) |
|
||||||
|
| ARM64 native (GPU) | ✅ | ❌ Blackwell issue | - |
|
||||||
|
| AMD64 via QEMU | ⚠️ Works | ❌ No passthrough | 10-100x slower |
|
||||||
|
|
||||||
### Options for ARM64 Systems
|
### Options for ARM64 Systems
|
||||||
|
|
||||||
#### Option 1: CPU-Only (Recommended)
|
#### Option 1: CPU-Only (Recommended)
|
||||||
|
|||||||
Reference in New Issue
Block a user