{ "cells": [ { "cell_type": "markdown", "id": "header", "metadata": { "papermill": { "duration": 0.00208, "end_time": "2026-01-18T07:22:47.796550", "exception": false, "start_time": "2026-01-18T07:22:47.794470", "status": "completed" }, "tags": [] }, "source": [ "# PaddleOCR Hyperparameter Optimization via REST API\n", "\n", "This notebook runs Ray Tune hyperparameter search calling the PaddleOCR REST API (Docker container).\n", "\n", "**Benefits:**\n", "- No model reload per trial - Model stays loaded in Docker container\n", "- Faster trials - Skip ~10s model load time per trial\n", "- Cleaner code - REST API replaces subprocess + CLI arg parsing" ] }, { "cell_type": "markdown", "id": "prereq", "metadata": { "papermill": { "duration": 0.000961, "end_time": "2026-01-18T07:22:47.807230", "exception": false, "start_time": "2026-01-18T07:22:47.806269", "status": "completed" }, "tags": [] }, "source": [ "## Prerequisites\n", "\n", "Start 2 PaddleOCR workers for parallel hyperparameter tuning:\n", "\n", "```bash\n", "cd src/paddle_ocr\n", "docker compose -f docker-compose.workers.yml up\n", "```\n", "\n", "This starts 2 GPU workers on ports 8001-8002, allowing 2 concurrent trials.\n", "\n", "For CPU-only systems:\n", "```bash\n", "docker compose -f docker-compose.workers.yml --profile cpu up\n", "```" ] }, { "cell_type": "markdown", "id": "3ob9fsoilc4", "metadata": { "papermill": { "duration": 0.000901, "end_time": "2026-01-18T07:22:47.809075", "exception": false, "start_time": "2026-01-18T07:22:47.808174", "status": "completed" }, "tags": [] }, "source": [ "## 0. Dependencies" ] }, { "cell_type": "code", "execution_count": 1, "id": "wyr2nsoj7", "metadata": { "execution": { "iopub.execute_input": "2026-01-18T07:22:47.812056Z", "iopub.status.busy": "2026-01-18T07:22:47.811910Z", "iopub.status.idle": "2026-01-18T07:22:49.130013Z", "shell.execute_reply": "2026-01-18T07:22:49.129363Z" }, "papermill": { "duration": 1.321151, "end_time": "2026-01-18T07:22:49.131123", "exception": false, "start_time": "2026-01-18T07:22:47.809972", "status": "completed" }, "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: ray[tune] in /home/sergio/MastersThesis/.venv/lib/python3.12/site-packages (2.53.0)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: click>=7.0 in /home/sergio/MastersThesis/.venv/lib/python3.12/site-packages (from ray[tune]) (8.3.1)\r\n", "Requirement already satisfied: filelock in /home/sergio/MastersThesis/.venv/lib/python3.12/site-packages (from ray[tune]) (3.20.3)\r\n", "Requirement already satisfied: jsonschema in /home/sergio/MastersThesis/.venv/lib/python3.12/site-packages (from ray[tune]) (4.26.0)\r\n", "Requirement already satisfied: msgpack<2.0.0,>=1.0.0 in /home/sergio/MastersThesis/.venv/lib/python3.12/site-packages (from ray[tune]) (1.1.2)\r\n", "Requirement already satisfied: packaging>=24.2 in /home/sergio/MastersThesis/.venv/lib/python3.12/site-packages (from ray[tune]) (25.0)\r\n", "Requirement already satisfied: protobuf>=3.20.3 in /home/sergio/MastersThesis/.venv/lib/python3.12/site-packages (from ray[tune]) (6.33.4)\r\n", "Requirement already satisfied: pyyaml in /home/sergio/MastersThesis/.venv/lib/python3.12/site-packages (from ray[tune]) (6.0.3)\r\n", "Requirement already satisfied: requests in /home/sergio/MastersThesis/.venv/lib/python3.12/site-packages (from ray[tune]) (2.32.5)\r\n", "Requirement already satisfied: pandas in /home/sergio/MastersThesis/.venv/lib/python3.12/site-packages (from ray[tune]) (2.3.3)\r\n", "Requirement already satisfied: pydantic!=2.0.*,!=2.1.*,!=2.10.*,!=2.11.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,!=2.8.*,!=2.9.*,<3 in /home/sergio/MastersThesis/.venv/lib/python3.12/site-packages (from ray[tune]) (2.12.5)\r\n", "Requirement already satisfied: tensorboardX>=1.9 in /home/sergio/MastersThesis/.venv/lib/python3.12/site-packages (from ray[tune]) (2.6.4)\r\n", "Requirement already satisfied: pyarrow>=9.0.0 in /home/sergio/MastersThesis/.venv/lib/python3.12/site-packages (from ray[tune]) (22.0.0)\r\n", "Requirement already satisfied: fsspec in /home/sergio/MastersThesis/.venv/lib/python3.12/site-packages (from ray[tune]) (2026.1.0)\r\n", "Requirement already satisfied: annotated-types>=0.6.0 in /home/sergio/MastersThesis/.venv/lib/python3.12/site-packages (from pydantic!=2.0.*,!=2.1.*,!=2.10.*,!=2.11.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,!=2.8.*,!=2.9.*,<3->ray[tune]) (0.7.0)\r\n", "Requirement already satisfied: pydantic-core==2.41.5 in /home/sergio/MastersThesis/.venv/lib/python3.12/site-packages (from pydantic!=2.0.*,!=2.1.*,!=2.10.*,!=2.11.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,!=2.8.*,!=2.9.*,<3->ray[tune]) (2.41.5)\r\n", "Requirement already satisfied: typing-extensions>=4.14.1 in /home/sergio/MastersThesis/.venv/lib/python3.12/site-packages (from pydantic!=2.0.*,!=2.1.*,!=2.10.*,!=2.11.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,!=2.8.*,!=2.9.*,<3->ray[tune]) (4.15.0)\r\n", "Requirement already satisfied: typing-inspection>=0.4.2 in /home/sergio/MastersThesis/.venv/lib/python3.12/site-packages (from pydantic!=2.0.*,!=2.1.*,!=2.10.*,!=2.11.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,!=2.8.*,!=2.9.*,<3->ray[tune]) (0.4.2)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: numpy in /home/sergio/MastersThesis/.venv/lib/python3.12/site-packages (from tensorboardX>=1.9->ray[tune]) (2.4.1)\r\n", "Requirement already satisfied: attrs>=22.2.0 in /home/sergio/MastersThesis/.venv/lib/python3.12/site-packages (from jsonschema->ray[tune]) (25.4.0)\r\n", "Requirement already satisfied: jsonschema-specifications>=2023.03.6 in /home/sergio/MastersThesis/.venv/lib/python3.12/site-packages (from jsonschema->ray[tune]) (2025.9.1)\r\n", "Requirement already satisfied: referencing>=0.28.4 in /home/sergio/MastersThesis/.venv/lib/python3.12/site-packages (from jsonschema->ray[tune]) (0.37.0)\r\n", "Requirement already satisfied: rpds-py>=0.25.0 in /home/sergio/MastersThesis/.venv/lib/python3.12/site-packages (from jsonschema->ray[tune]) (0.30.0)\r\n", "Requirement already satisfied: python-dateutil>=2.8.2 in /home/sergio/MastersThesis/.venv/lib/python3.12/site-packages (from pandas->ray[tune]) (2.9.0.post0)\r\n", "Requirement already satisfied: pytz>=2020.1 in /home/sergio/MastersThesis/.venv/lib/python3.12/site-packages (from pandas->ray[tune]) (2025.2)\r\n", "Requirement already satisfied: tzdata>=2022.7 in /home/sergio/MastersThesis/.venv/lib/python3.12/site-packages (from pandas->ray[tune]) (2025.3)\r\n", "Requirement already satisfied: six>=1.5 in /home/sergio/MastersThesis/.venv/lib/python3.12/site-packages (from python-dateutil>=2.8.2->pandas->ray[tune]) (1.17.0)\r\n", "Requirement already satisfied: charset_normalizer<4,>=2 in /home/sergio/MastersThesis/.venv/lib/python3.12/site-packages (from requests->ray[tune]) (3.4.4)\r\n", "Requirement already satisfied: idna<4,>=2.5 in /home/sergio/MastersThesis/.venv/lib/python3.12/site-packages (from requests->ray[tune]) (3.11)\r\n", "Requirement already satisfied: urllib3<3,>=1.21.1 in /home/sergio/MastersThesis/.venv/lib/python3.12/site-packages (from requests->ray[tune]) (2.6.3)\r\n", "Requirement already satisfied: certifi>=2017.4.17 in /home/sergio/MastersThesis/.venv/lib/python3.12/site-packages (from requests->ray[tune]) (2026.1.4)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Note: you may need to restart the kernel to use updated packages.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: optuna in /home/sergio/MastersThesis/.venv/lib/python3.12/site-packages (4.6.0)\r\n", "Requirement already satisfied: alembic>=1.5.0 in /home/sergio/MastersThesis/.venv/lib/python3.12/site-packages (from optuna) (1.18.1)\r\n", "Requirement already satisfied: colorlog in /home/sergio/MastersThesis/.venv/lib/python3.12/site-packages (from optuna) (6.10.1)\r\n", "Requirement already satisfied: numpy in /home/sergio/MastersThesis/.venv/lib/python3.12/site-packages (from optuna) (2.4.1)\r\n", "Requirement already satisfied: packaging>=20.0 in /home/sergio/MastersThesis/.venv/lib/python3.12/site-packages (from optuna) (25.0)\r\n", "Requirement already satisfied: sqlalchemy>=1.4.2 in /home/sergio/MastersThesis/.venv/lib/python3.12/site-packages (from optuna) (2.0.45)\r\n", "Requirement already satisfied: tqdm in /home/sergio/MastersThesis/.venv/lib/python3.12/site-packages (from optuna) (4.67.1)\r\n", "Requirement already satisfied: PyYAML in /home/sergio/MastersThesis/.venv/lib/python3.12/site-packages (from optuna) (6.0.3)\r\n", "Requirement already satisfied: Mako in /home/sergio/MastersThesis/.venv/lib/python3.12/site-packages (from alembic>=1.5.0->optuna) (1.3.10)\r\n", "Requirement already satisfied: typing-extensions>=4.12 in /home/sergio/MastersThesis/.venv/lib/python3.12/site-packages (from alembic>=1.5.0->optuna) (4.15.0)\r\n", "Requirement already satisfied: greenlet>=1 in /home/sergio/MastersThesis/.venv/lib/python3.12/site-packages (from sqlalchemy>=1.4.2->optuna) (3.3.0)\r\n", "Requirement already satisfied: MarkupSafe>=0.9.2 in /home/sergio/MastersThesis/.venv/lib/python3.12/site-packages (from Mako->alembic>=1.5.0->optuna) (3.0.3)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Note: you may need to restart the kernel to use updated packages.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: requests in /home/sergio/MastersThesis/.venv/lib/python3.12/site-packages (2.32.5)\r\n", "Requirement already satisfied: pandas in /home/sergio/MastersThesis/.venv/lib/python3.12/site-packages (2.3.3)\r\n", "Requirement already satisfied: charset_normalizer<4,>=2 in /home/sergio/MastersThesis/.venv/lib/python3.12/site-packages (from requests) (3.4.4)\r\n", "Requirement already satisfied: idna<4,>=2.5 in /home/sergio/MastersThesis/.venv/lib/python3.12/site-packages (from requests) (3.11)\r\n", "Requirement already satisfied: urllib3<3,>=1.21.1 in /home/sergio/MastersThesis/.venv/lib/python3.12/site-packages (from requests) (2.6.3)\r\n", "Requirement already satisfied: certifi>=2017.4.17 in /home/sergio/MastersThesis/.venv/lib/python3.12/site-packages (from requests) (2026.1.4)\r\n", "Requirement already satisfied: numpy>=1.26.0 in /home/sergio/MastersThesis/.venv/lib/python3.12/site-packages (from pandas) (2.4.1)\r\n", "Requirement already satisfied: python-dateutil>=2.8.2 in /home/sergio/MastersThesis/.venv/lib/python3.12/site-packages (from pandas) (2.9.0.post0)\r\n", "Requirement already satisfied: pytz>=2020.1 in /home/sergio/MastersThesis/.venv/lib/python3.12/site-packages (from pandas) (2025.2)\r\n", "Requirement already satisfied: tzdata>=2022.7 in /home/sergio/MastersThesis/.venv/lib/python3.12/site-packages (from pandas) (2025.3)\r\n", "Requirement already satisfied: six>=1.5 in /home/sergio/MastersThesis/.venv/lib/python3.12/site-packages (from python-dateutil>=2.8.2->pandas) (1.17.0)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Note: you may need to restart the kernel to use updated packages.\n" ] } ], "source": [ "# Install dependencies (run once)\n", "%pip install -U \"ray[tune]\"\n", "%pip install optuna\n", "%pip install requests pandas" ] }, { "cell_type": "markdown", "id": "imports-header", "metadata": { "papermill": { "duration": 0.002313, "end_time": "2026-01-18T07:22:49.136199", "exception": false, "start_time": "2026-01-18T07:22:49.133886", "status": "completed" }, "tags": [] }, "source": [ "## 1. Imports & Setup" ] }, { "cell_type": "code", "execution_count": 2, "id": "imports", "metadata": { "execution": { "iopub.execute_input": "2026-01-18T07:22:49.141850Z", "iopub.status.busy": "2026-01-18T07:22:49.141713Z", "iopub.status.idle": "2026-01-18T07:22:50.248414Z", "shell.execute_reply": "2026-01-18T07:22:50.247699Z" }, "papermill": { "duration": 1.111175, "end_time": "2026-01-18T07:22:50.249605", "exception": false, "start_time": "2026-01-18T07:22:49.138430", "status": "completed" }, "tags": [] }, "outputs": [], "source": [ "import os\n", "from datetime import datetime\n", "\n", "import requests\n", "import pandas as pd\n", "\n", "import ray\n", "from ray import tune, air\n", "from ray.tune.search.optuna import OptunaSearch" ] }, { "cell_type": "markdown", "id": "config-header", "metadata": { "papermill": { "duration": 0.00953, "end_time": "2026-01-18T07:22:50.261880", "exception": false, "start_time": "2026-01-18T07:22:50.252350", "status": "completed" }, "tags": [] }, "source": [ "## 2. API Configuration" ] }, { "cell_type": "code", "execution_count": 3, "id": "config", "metadata": { "execution": { "iopub.execute_input": "2026-01-18T07:22:50.267482Z", "iopub.status.busy": "2026-01-18T07:22:50.267340Z", "iopub.status.idle": "2026-01-18T07:22:50.269689Z", "shell.execute_reply": "2026-01-18T07:22:50.269264Z" }, "papermill": { "duration": 0.006027, "end_time": "2026-01-18T07:22:50.270230", "exception": false, "start_time": "2026-01-18T07:22:50.264203", "status": "completed" }, "tags": [] }, "outputs": [], "source": [ "# PaddleOCR REST API endpoints - 2 workers for parallel trials\n", "# Start workers with: cd src/paddle_ocr && docker compose -f docker-compose.workers.yml up\n", "WORKER_PORTS = [8001, 8002]\n", "WORKER_URLS = [f\"http://localhost:{port}\" for port in WORKER_PORTS]\n", "\n", "# Output folder for results\n", "OUTPUT_FOLDER = \"results\"\n", "os.makedirs(OUTPUT_FOLDER, exist_ok=True)\n", "\n", "# Number of concurrent trials = number of workers\n", "NUM_WORKERS = len(WORKER_URLS)" ] }, { "cell_type": "code", "execution_count": 4, "id": "health-check", "metadata": { "execution": { "iopub.execute_input": "2026-01-18T07:22:50.275708Z", "iopub.status.busy": "2026-01-18T07:22:50.275626Z", "iopub.status.idle": "2026-01-18T07:22:50.283441Z", "shell.execute_reply": "2026-01-18T07:22:50.282984Z" }, "papermill": { "duration": 0.011534, "end_time": "2026-01-18T07:22:50.284080", "exception": false, "start_time": "2026-01-18T07:22:50.272546", "status": "completed" }, "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "✓ http://localhost:8001: ok (GPU: None)\n", "✓ http://localhost:8002: ok (GPU: None)\n", "\n", "2/2 workers ready for parallel tuning\n" ] } ], "source": [ "# Verify all workers are running\n", "healthy_workers = []\n", "for url in WORKER_URLS:\n", " try:\n", " health = requests.get(f\"{url}/health\", timeout=10).json()\n", " if health['status'] == 'ok' and health['model_loaded']:\n", " healthy_workers.append(url)\n", " print(f\"✓ {url}: {health['status']} (GPU: {health.get('gpu_name', 'N/A')})\")\n", " else:\n", " print(f\"✗ {url}: not ready yet\")\n", " except requests.exceptions.ConnectionError:\n", " print(f\"✗ {url}: not reachable\")\n", "\n", "if not healthy_workers:\n", " raise RuntimeError(\n", " \"No healthy workers found. Start them with:\\n\"\n", " \" cd src/paddle_ocr && docker compose -f docker-compose.workers.yml up\"\n", " )\n", "\n", "print(f\"\\n{len(healthy_workers)}/{len(WORKER_URLS)} workers ready for parallel tuning\")" ] }, { "cell_type": "markdown", "id": "search-space-header", "metadata": { "papermill": { "duration": 0.002325, "end_time": "2026-01-18T07:22:50.288969", "exception": false, "start_time": "2026-01-18T07:22:50.286644", "status": "completed" }, "tags": [] }, "source": [ "## 3. Search Space" ] }, { "cell_type": "code", "execution_count": 5, "id": "search-space", "metadata": { "execution": { "iopub.execute_input": "2026-01-18T07:22:50.294569Z", "iopub.status.busy": "2026-01-18T07:22:50.294500Z", "iopub.status.idle": "2026-01-18T07:22:50.296998Z", "shell.execute_reply": "2026-01-18T07:22:50.296295Z" }, "papermill": { "duration": 0.006486, "end_time": "2026-01-18T07:22:50.297804", "exception": false, "start_time": "2026-01-18T07:22:50.291318", "status": "completed" }, "tags": [] }, "outputs": [], "source": [ "search_space = {\n", " # Whether to use document image orientation classification\n", " \"use_doc_orientation_classify\": tune.choice([True, False]),\n", " # Whether to use text image unwarping\n", " \"use_doc_unwarping\": tune.choice([True, False]),\n", " # Whether to use text line orientation classification\n", " \"textline_orientation\": tune.choice([True, False]),\n", " # Detection pixel threshold (pixels > threshold are considered text)\n", " \"text_det_thresh\": tune.uniform(0.0, 0.7),\n", " # Detection box threshold (average score within border)\n", " \"text_det_box_thresh\": tune.uniform(0.0, 0.7),\n", " # Text detection expansion coefficient\n", " \"text_det_unclip_ratio\": tune.choice([0.0]),\n", " # Text recognition threshold (filter low confidence results)\n", " \"text_rec_score_thresh\": tune.uniform(0.0, 0.7),\n", "}" ] }, { "cell_type": "markdown", "id": "trainable-header", "metadata": { "papermill": { "duration": 0.002321, "end_time": "2026-01-18T07:22:50.302532", "exception": false, "start_time": "2026-01-18T07:22:50.300211", "status": "completed" }, "tags": [] }, "source": [ "## 4. Trainable Function" ] }, { "cell_type": "code", "execution_count": 6, "id": "trainable", "metadata": { "execution": { "iopub.execute_input": "2026-01-18T07:22:50.308222Z", "iopub.status.busy": "2026-01-18T07:22:50.308103Z", "iopub.status.idle": "2026-01-18T07:22:50.311240Z", "shell.execute_reply": "2026-01-18T07:22:50.310694Z" }, "papermill": { "duration": 0.007301, "end_time": "2026-01-18T07:22:50.312116", "exception": false, "start_time": "2026-01-18T07:22:50.304815", "status": "completed" }, "tags": [] }, "outputs": [], "source": [ "def trainable_paddle_ocr(config):\n", " \"\"\"Call PaddleOCR REST API with the given hyperparameter config.\"\"\"\n", " import random\n", " import requests\n", " from ray import tune\n", "\n", " # Worker URLs - random selection (load balances with 2 workers, 2 concurrent trials)\n", " WORKER_PORTS = [8001, 8002]\n", " api_url = f\"http://localhost:{random.choice(WORKER_PORTS)}\"\n", "\n", " payload = {\n", " \"pdf_folder\": \"/app/dataset\",\n", " \"use_doc_orientation_classify\": config.get(\"use_doc_orientation_classify\", False),\n", " \"use_doc_unwarping\": config.get(\"use_doc_unwarping\", False),\n", " \"textline_orientation\": config.get(\"textline_orientation\", True),\n", " \"text_det_thresh\": config.get(\"text_det_thresh\", 0.0),\n", " \"text_det_box_thresh\": config.get(\"text_det_box_thresh\", 0.0),\n", " \"text_det_unclip_ratio\": config.get(\"text_det_unclip_ratio\", 1.5),\n", " \"text_rec_score_thresh\": config.get(\"text_rec_score_thresh\", 0.0),\n", " \"start_page\": 5,\n", " \"end_page\": 10,\n", " }\n", "\n", " try:\n", " response = requests.post(f\"{api_url}/evaluate\", json=payload, timeout=None)\n", " response.raise_for_status()\n", " metrics = response.json()\n", " metrics[\"worker\"] = api_url\n", " tune.report(**metrics)\n", " except Exception as e:\n", " tune.report(\n", " CER=1.0,\n", " WER=1.0,\n", " TIME=0.0,\n", " PAGES=0,\n", " TIME_PER_PAGE=0,\n", " worker=api_url,\n", " ERROR=str(e)[:500]\n", " )" ] }, { "cell_type": "markdown", "id": "tuner-header", "metadata": { "papermill": { "duration": 0.002522, "end_time": "2026-01-18T07:22:50.317277", "exception": false, "start_time": "2026-01-18T07:22:50.314755", "status": "completed" }, "tags": [] }, "source": [ "## 5. Run Tuner" ] }, { "cell_type": "code", "execution_count": 7, "id": "ray-init", "metadata": { "execution": { "iopub.execute_input": "2026-01-18T07:22:50.323163Z", "iopub.status.busy": "2026-01-18T07:22:50.323037Z", "iopub.status.idle": "2026-01-18T07:22:54.197904Z", "shell.execute_reply": "2026-01-18T07:22:54.196986Z" }, "papermill": { "duration": 3.878908, "end_time": "2026-01-18T07:22:54.198593", "exception": false, "start_time": "2026-01-18T07:22:50.319685", "status": "completed" }, "tags": [] }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2026-01-18 08:22:51,904\tINFO worker.py:2007 -- Started a local Ray instance.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Ray Tune ready (version: 2.53.0)\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/home/sergio/MastersThesis/.venv/lib/python3.12/site-packages/ray/_private/worker.py:2046: FutureWarning: Tip: In future versions of Ray, Ray will no longer override accelerator visible devices env var if num_gpus=0 or num_gpus=None (default). To enable this behavior and turn off this error message, set RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO=0\n", " warnings.warn(\n" ] } ], "source": [ "ray.init(ignore_reinit_error=True)\n", "print(f\"Ray Tune ready (version: {ray.__version__})\")" ] }, { "cell_type": "code", "execution_count": 8, "id": "tuner", "metadata": { "execution": { "iopub.execute_input": "2026-01-18T07:22:54.213071Z", "iopub.status.busy": "2026-01-18T07:22:54.212310Z" }, "papermill": { "duration": null, "end_time": null, "exception": false, "start_time": "2026-01-18T07:22:54.201610", "status": "running" }, "tags": [] }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/sergio/MastersThesis/.venv/lib/python3.12/site-packages/ray/tune/impl/tuner_internal.py:144: RayDeprecationWarning: The `RunConfig` class should be imported from `ray.tune` when passing it to the Tuner. Please update your imports. See this issue for more context and migration options: https://github.com/ray-project/ray/issues/49454. Disable these warnings by setting the environment variable: RAY_TRAIN_ENABLE_V2_MIGRATION_WARNINGS=0\n", " _log_deprecation_warning(\n", "2026-01-18 08:22:54,222\tINFO tune.py:616 -- [output] This uses the legacy output and progress reporter, as Jupyter notebooks are not supported by the new engine, yet. For more information, please see https://github.com/ray-project/ray/issues/36949\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "[I 2026-01-18 08:22:54,226] A new study created in memory with name: optuna\n" ] }, { "data": { "text/html": [ "
\n", "
\n", "
\n", "

Tune Status

\n", " \n", "\n", "\n", "\n", "\n", "\n", "
Current time:2026-01-18 08:23:19
Running for: 00:00:25.26
Memory: 57.8/119.7 GiB
\n", "
\n", "
\n", "
\n", "

System Info

\n", " Using FIFO scheduling algorithm.
Logical resource usage: 2.0/20 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:GB10)\n", "
\n", " \n", "
\n", "
\n", "
\n", "

Trial Status

\n", " \n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Trial name status loc text_det_box_thresh text_det_thresh text_det_unclip_rati\n", "o text_rec_score_thres\n", "htextline_orientation use_doc_orientation_\n", "classify use_doc_unwarping
trainable_paddle_ocr_59252191RUNNING 192.168.65.140:1195312 0.414043 0.33747500.478234True True True
trainable_paddle_ocr_47499299RUNNING 192.168.65.140:1195374 0.544738 0.26973500.30771 True FalseTrue
\n", "
\n", "
\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[36m(pid=gcs_server)\u001b[0m [2026-01-18 08:23:20,495 E 1193965 1193965] (gcs_server) gcs_server.cc:303: Failed to establish connection to the event+metrics exporter agent. Events and metrics will not be exported. Exporter agent status: RpcError: Running out of retries to initialize the metrics agent. rpc_code: 14\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[33m(raylet)\u001b[0m [2026-01-18 08:23:21,833 E 1194136 1194136] (raylet) main.cc:1032: Failed to establish connection to the metrics exporter agent. Metrics will not be exported. Exporter agent status: RpcError: Running out of retries to initialize the metrics agent. rpc_code: 14\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[36m(bundle_reservation_check_func pid=1194212)\u001b[0m [2026-01-18 08:23:23,446 E 1194212 1194301] core_worker_process.cc:842: Failed to establish connection to the metrics exporter agent. Metrics will not be exported. Exporter agent status: RpcError: Running out of retries to initialize the metrics agent. rpc_code: 14\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "[2026-01-18 08:23:24,197 E 1193837 1194205] core_worker_process.cc:842: Failed to establish connection to the metrics exporter agent. Metrics will not be exported. Exporter agent status: RpcError: Running out of retries to initialize the metrics agent. rpc_code: 14\n" ] } ], "source": [ "tuner = tune.Tuner(\n", " trainable_paddle_ocr,\n", " tune_config=tune.TuneConfig(\n", " metric=\"CER\",\n", " mode=\"min\",\n", " search_alg=OptunaSearch(),\n", " num_samples=64,\n", " max_concurrent_trials=NUM_WORKERS, # Run trials in parallel across workers\n", " ),\n", " run_config=air.RunConfig(verbose=2, log_to_file=False),\n", " param_space=search_space,\n", ")\n", "\n", "results = tuner.fit()" ] }, { "cell_type": "markdown", "id": "analysis-header", "metadata": { "papermill": { "duration": null, "end_time": null, "exception": null, "start_time": null, "status": "pending" }, "tags": [] }, "source": [ "## 6. Results Analysis" ] }, { "cell_type": "code", "execution_count": null, "id": "results-df", "metadata": { "papermill": { "duration": null, "end_time": null, "exception": null, "start_time": null, "status": "pending" }, "tags": [] }, "outputs": [], "source": [ "df = results.get_dataframe()\n", "df.describe()" ] }, { "cell_type": "code", "execution_count": null, "id": "save-results", "metadata": { "papermill": { "duration": null, "end_time": null, "exception": null, "start_time": null, "status": "pending" }, "tags": [] }, "outputs": [], "source": [ "# Save results to CSV\n", "timestamp = datetime.now().strftime(\"%Y%m%d_%H%M%S\")\n", "filename = f\"raytune_paddle_rest_results_{timestamp}.csv\"\n", "filepath = os.path.join(OUTPUT_FOLDER, filename)\n", "\n", "df.to_csv(filepath, index=False)\n", "print(f\"Results saved: {filepath}\")" ] }, { "cell_type": "code", "execution_count": null, "id": "best-config", "metadata": { "papermill": { "duration": null, "end_time": null, "exception": null, "start_time": null, "status": "pending" }, "tags": [] }, "outputs": [], "source": [ "# Best configuration\n", "best = df.loc[df[\"CER\"].idxmin()]\n", "\n", "print(f\"Best CER: {best['CER']:.6f}\")\n", "print(f\"Best WER: {best['WER']:.6f}\")\n", "print(f\"\\nOptimal Configuration:\")\n", "print(f\" textline_orientation: {best['config/textline_orientation']}\")\n", "print(f\" use_doc_orientation_classify: {best['config/use_doc_orientation_classify']}\")\n", "print(f\" use_doc_unwarping: {best['config/use_doc_unwarping']}\")\n", "print(f\" text_det_thresh: {best['config/text_det_thresh']:.4f}\")\n", "print(f\" text_det_box_thresh: {best['config/text_det_box_thresh']:.4f}\")\n", "print(f\" text_det_unclip_ratio: {best['config/text_det_unclip_ratio']}\")\n", "print(f\" text_rec_score_thresh: {best['config/text_rec_score_thresh']:.4f}\")" ] }, { "cell_type": "code", "execution_count": null, "id": "correlation", "metadata": { "papermill": { "duration": null, "end_time": null, "exception": null, "start_time": null, "status": "pending" }, "tags": [] }, "outputs": [], "source": [ "# Correlation analysis\n", "param_cols = [\n", " \"config/text_det_thresh\",\n", " \"config/text_det_box_thresh\",\n", " \"config/text_det_unclip_ratio\",\n", " \"config/text_rec_score_thresh\",\n", "]\n", "\n", "corr_cer = df[param_cols + [\"CER\"]].corr()[\"CER\"].sort_values(ascending=False)\n", "corr_wer = df[param_cols + [\"WER\"]].corr()[\"WER\"].sort_values(ascending=False)\n", "\n", "print(\"Correlation with CER:\")\n", "print(corr_cer)\n", "print(\"\\nCorrelation with WER:\")\n", "print(corr_wer)" ] } ], "metadata": { "kernelspec": { "display_name": ".venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.3" }, "papermill": { "default_parameters": {}, "duration": null, "end_time": null, "environment_variables": {}, "exception": null, "input_path": "paddle_ocr_raytune_rest.ipynb", "output_path": "output_raytune.ipynb", "parameters": {}, "start_time": "2026-01-18T07:22:47.169883", "version": "2.6.0" } }, "nbformat": 4, "nbformat_minor": 5 }