- Run GLM-OCR on RunPod Serverless: 17-line Dockerfile
Run GLM-OCR on RunPod Serverless: 17-line Dockerfile
Custom Dockerfile with Transformers v5 and pre-baked GLM-OCR weights for fast RunPod serverless cold starts

🐳 Docker & DevOps Implementation Guides
Complete Docker guides with optimization techniques, deployment strategies, and automation prompts to streamline your containerization workflow.
Related Posts:
I wanted to run GLM-OCR as a serverless endpoint. Not a dedicated pod sitting idle burning credits, but a proper serverless setup that scales to zero when nobody is calling it and spins up on demand. RunPod Serverless seemed like the right fit, but getting there was not as straightforward as I expected.
The problem is that GLM-OCR requires a bleeding-edge version of Transformers (v5+ dev branch) that no public Docker image ships with. You cannot just pick vllm/vllm-openai:nightly from the registry, set a start command, and go. The model will not load. You need a custom image, and if you are running serverless, you want that image to be as self-contained as possible so cold starts do not punish you with multi-gigabyte downloads every time a new worker spins up.
This guide walks through the exact Dockerfile I built, the gotchas I hit along the way, and how to deploy it on RunPod Serverless. The full source is available at github.com/matija2209/ocr-docker.
Why not just use a public image?
RunPod Serverless gives you two options: use a public Docker image or build from a public GitHub repo. The public image route sounds appealing — point it at vllm/vllm-openai:nightly, add a start command that installs Transformers from source, and let it rip.
The start command would look something like this:
bash -lc '
set -e
pip uninstall -y transformers || true
pip install -U git+https://github.com/huggingface/transformers.git
exec vllm serve zai-org/GLM-OCR --allowed-local-media-path / --port 8080
'
This technically works, but it has two serious downsides for serverless. First, every cold start pays the cost of installing Transformers from source. That is a git clone plus a wheel build on every new worker. Second, the model weights get downloaded from HuggingFace on every cold start too. For a model that is roughly 2GB, that adds meaningful latency before your endpoint can serve its first request.
The better approach is to bake everything into the image at build time: the Transformers upgrade, the model weights, all of it. One build, and every cold start after that just loads from disk.
The Dockerfile
Here is the complete Dockerfile. It is short, but every line is there for a reason.
# File: Dockerfile FROM vllm/vllm-openai:nightly # git is needed for pip install from GitHub RUN apt-get update && apt-get install -y --no-install-recommends git \ && rm -rf /var/lib/apt/lists/* # Install newer Transformers so GLM-OCR is recognized RUN pip uninstall -y transformers || true \ && pip install -U git+https://github.com/huggingface/transformers.git # Pre-download model weights into the image so cold starts don't hit HuggingFace ENV HF_HOME=/root/.cache/huggingface RUN python3 -c "from huggingface_hub import snapshot_download; snapshot_download('zai-org/GLM-OCR')" EXPOSE 8080 CMD ["vllm", "serve", "zai-org/GLM-OCR", "--allowed-local-media-path", "/", "--port", "8080"]
Let me walk through what each piece does and why it is necessary.
Installing git
The vllm/vllm-openai:nightly base image does not include git. That might seem like a minor detail, but pip install git+https://... literally needs git to clone the repository. Without it, the build fails immediately with executable file not found in $PATH. A quick apt-get install solves this.
RUN apt-get update && apt-get install -y --no-install-recommends git \ && rm -rf /var/lib/apt/lists/*
The --no-install-recommends flag keeps the layer small by skipping suggested packages, and cleaning up /var/lib/apt/lists/* removes the package index cache.
Upgrading Transformers
GLM-OCR requires Transformers v5+, which at the time of writing has not been released to PyPI yet. The only way to get it is to install directly from the main branch on GitHub.
RUN pip uninstall -y transformers || true \ && pip install -U git+https://github.com/huggingface/transformers.git
The uninstall-then-install pattern ensures a clean replacement. You will see a pip warning that vLLM requires transformers<5, but this is safe to ignore. vLLM works fine with the dev branch, and GLM-OCR will not load without it.
Baking in the model weights
This is the most important step for serverless performance. Without it, every new worker downloads roughly 2GB of model weights from HuggingFace before it can serve a single request.
ENV HF_HOME=/root/.cache/huggingface RUN python3 -c "from huggingface_hub import snapshot_download; snapshot_download('zai-org/GLM-OCR')"
A few things to note here. The command uses python3, not python. The base image is Debian-based and does not alias python to python3, so using python will fail with a "not found" error. I also use the Python API directly (snapshot_download) rather than the huggingface-cli command because the CLI binary can end up outside of $PATH after upgrading huggingface-hub during the Transformers install.
GLM-OCR is a public model under the MIT license, so no authentication token is needed. You will see a warning about unauthenticated requests having lower rate limits, but the download completes fine. If you want faster downloads during builds, you can set HF_TOKEN as an environment variable in RunPod's build settings.
The resulting image is around 10.5GB. That is large by general Docker standards, but completely normal for ML serving images and well within RunPod's limits.
Deploying on RunPod Serverless
Push the Dockerfile to a public GitHub repository. I use github.com/matija2209/ocr-docker for this.
In RunPod, create a new Serverless endpoint and select Build from GitHub repo. Point it to your repository. You do not need to set a container start command because the CMD in the Dockerfile already handles it.
No environment variables are required for the basic setup. The model is public, the port is configured, and the weights are in the image. Just deploy and wait for the build to complete.
Once the endpoint is live, it serves an OpenAI-compatible API. You can call it like this:
curl https://<your-runpod-endpoint>/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "zai-org/GLM-OCR",
"messages": [
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "https://example.com/document.png"}},
{"type": "text", "text": "Text Recognition:"}
]
}
]
}'
GLM-OCR prompt reference
GLM-OCR is not a general-purpose vision model. It supports a specific set of prompts, and using anything outside of these will give you unreliable results.
For document parsing, use one of these exact prompt strings:
Text Recognition:for extracting raw textFormula Recognition:for mathematical formulasTable Recognition:for table structures
For information extraction, provide a JSON schema that defines exactly what fields you want. The model will fill in the values from the document:
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "path/to/id-card.png"}},
{"type": "text", "text": "Please output the information in the image in the following JSON format:\n{\"name\": \"\", \"date_of_birth\": \"\", \"id_number\": \"\"}"}
]
}
The output must strictly follow the JSON schema you provide. This is by design and is what makes the model useful for structured document processing pipelines.
Gotchas I hit along the way
For reference, here are the build failures I ran into so you can avoid them:
No git in base image. The vllm/vllm-openai:nightly image does not ship with git. Any pip install git+https://... will fail with exit code 127. Install git first.
python vs python3. The base image only has python3 on PATH. Using python gives you "not found". Always use python3 explicitly.
huggingface-cli not on PATH. After upgrading huggingface-hub as a dependency of Transformers, the CLI binary can land somewhere outside of $PATH. Using the Python API directly (from huggingface_hub import snapshot_download) bypasses this entirely.
Transformers version conflict warning. vLLM pins transformers<5 in its dependencies. Installing the v5 dev branch triggers a pip warning. It is safe to ignore — vLLM runs fine and GLM-OCR requires it.
Wrapping up
Running GLM-OCR on serverless comes down to one key decision: bake everything into the image. The model needs a version of Transformers that no public image ships, and serverless cold starts punish you for anything that needs to be installed or downloaded at runtime. A 17-line Dockerfile solves both problems.
The full Dockerfile is at github.com/matija2209/ocr-docker. Fork it, point RunPod at it, and you have a serverless OCR endpoint that scales to zero and starts fast.
Let me know in the comments if you have questions, and subscribe for more practical development guides.
Thanks, Matija


