How large is the final Docker image?

The final image is roughly 10.5GB due to the baked model weights and upgraded dependencies.

Do I need a Hugging Face token to build the image?

No — GLM-OCR is public and builds without authentication, but setting HF_TOKEN speeds downloads and avoids unauthenticated rate limits.

Will vLLM run with the Transformers v5 dev branch?

Yes — vLLM warns about transformers<5 but operates correctly with the dev branch; GLM-OCR requires the newer Transformers version.

Run GLM-OCR on RunPod Serverless: 17-line Dockerfile

I wanted to run GLM-OCR as a serverless endpoint. Not a dedicated pod sitting idle burning credits, but a proper serverless setup that scales to zero when nobody is calling it and spins up on demand. RunPod Serverless seemed like the right fit, but getting there was not as straightforward as I expected.

The problem is that GLM-OCR requires a bleeding-edge version of Transformers (v5+ dev branch) that no public Docker image ships with. You cannot just pick vllm/vllm-openai:nightly from the registry, set a start command, and go. The model will not load. You need a custom image, and if you are running serverless, you want that image to be as self-contained as possible so cold starts do not punish you with multi-gigabyte downloads every time a new worker spins up.

This guide walks through the exact Dockerfile I built, the gotchas I hit along the way, and how to deploy it on RunPod Serverless. The full source is available at github.com/matija2209/ocr-docker.

Why not just use a public image?

RunPod Serverless gives you two options: use a public Docker image or build from a public GitHub repo. The public image route sounds appealing — point it at vllm/vllm-openai:nightly, add a start command that installs Transformers from source, and let it rip.

The start command would look something like this:

bash -lc '
set -e
pip uninstall -y transformers || true
pip install -U git+https://github.com/huggingface/transformers.git
exec vllm serve zai-org/GLM-OCR --allowed-local-media-path / --port 8080
'

This technically works, but it has two serious downsides for serverless. First, every cold start pays the cost of installing Transformers from source. That is a git clone plus a wheel build on every new worker. Second, the model weights get downloaded from HuggingFace on every cold start too. For a model that is roughly 2GB, that adds meaningful latency before your endpoint can serve its first request.

The better approach is to bake everything into the image at build time: the Transformers upgrade, the model weights, all of it. One build, and every cold start after that just loads from disk.

The Dockerfile

Here is the complete Dockerfile. It is short, but every line is there for a reason.

# File: Dockerfile
FROM vllm/vllm-openai:nightly

# git is needed for pip install from GitHub
RUN apt-get update && apt-get install -y --no-install-recommends git \
 && rm -rf /var/lib/apt/lists/*

# Install newer Transformers so GLM-OCR is recognized
RUN pip uninstall -y transformers || true \
 && pip install -U git+https://github.com/huggingface/transformers.git

# Pre-download model weights into the image so cold starts don't hit HuggingFace
ENV HF_HOME=/root/.cache/huggingface
RUN python3 -c "from huggingface_hub import snapshot_download; snapshot_download('zai-org/GLM-OCR')"

EXPOSE 8080

CMD ["vllm", "serve", "zai-org/GLM-OCR", "--allowed-local-media-path", "/", "--port", "8080"]

Let me walk through what each piece does and why it is necessary.

Installing git

The vllm/vllm-openai:nightly base image does not include git. That might seem like a minor detail, but pip install git+https://... literally needs git to clone the repository. Without it, the build fails immediately with executable file not found in $PATH. A quick apt-get install solves this.

RUN apt-get update && apt-get install -y --no-install-recommends git \
 && rm -rf /var/lib/apt/lists/*

The --no-install-recommends flag keeps the layer small by skipping suggested packages, and cleaning up /var/lib/apt/lists/* removes the package index cache.

Upgrading Transformers

GLM-OCR requires Transformers v5+, which at the time of writing has not been released to PyPI yet. The only way to get it is to install directly from the main branch on GitHub.

RUN pip uninstall -y transformers || true \
 && pip install -U git+https://github.com/huggingface/transformers.git

The uninstall-then-install pattern ensures a clean replacement. You will see a pip warning that vLLM requires transformers<5, but this is safe to ignore. vLLM works fine with the dev branch, and GLM-OCR will not load without it.

Baking in the model weights

This is the most important step for serverless performance. Without it, every new worker downloads roughly 2GB of model weights from HuggingFace before it can serve a single request.

ENV HF_HOME=/root/.cache/huggingface
RUN python3 -c "from huggingface_hub import snapshot_download; snapshot_download('zai-org/GLM-OCR')"

A few things to note here. The command uses python3, not python. The base image is Debian-based and does not alias python to python3, so using python will fail with a "not found" error. I also use the Python API directly (snapshot_download) rather than the huggingface-cli command because the CLI binary can end up outside of $PATH after upgrading huggingface-hub during the Transformers install.

GLM-OCR is a public model under the MIT license, so no authentication token is needed. You will see a warning about unauthenticated requests having lower rate limits, but the download completes fine. If you want faster downloads during builds, you can set HF_TOKEN as an environment variable in RunPod's build settings.

The resulting image is around 10.5GB. That is large by general Docker standards, but completely normal for ML serving images and well within RunPod's limits.

Deploying on RunPod Serverless

Push the Dockerfile to a public GitHub repository. I use github.com/matija2209/ocr-docker for this.

In RunPod, create a new Serverless endpoint and select Build from GitHub repo. Point it to your repository. You do not need to set a container start command because the CMD in the Dockerfile already handles it.

No environment variables are required for the basic setup. The model is public, the port is configured, and the weights are in the image. Just deploy and wait for the build to complete.

Once the endpoint is live, it serves an OpenAI-compatible API. You can call it like this:

curl https://<your-runpod-endpoint>/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "zai-org/GLM-OCR",
    "messages": [
      {
        "role": "user",
        "content": [
          {"type": "image_url", "image_url": {"url": "https://example.com/document.png"}},
          {"type": "text", "text": "Text Recognition:"}
        ]
      }
    ]
  }'

GLM-OCR prompt reference

GLM-OCR is not a general-purpose vision model. It supports a specific set of prompts, and using anything outside of these will give you unreliable results.

For document parsing, use one of these exact prompt strings:

Text Recognition: for extracting raw text
Formula Recognition: for mathematical formulas
Table Recognition: for table structures

For information extraction, provide a JSON schema that defines exactly what fields you want. The model will fill in the values from the document:

{
  "role": "user",
  "content": [
    {"type": "image_url", "image_url": {"url": "path/to/id-card.png"}},
    {"type": "text", "text": "Please output the information in the image in the following JSON format:\n{\"name\": \"\", \"date_of_birth\": \"\", \"id_number\": \"\"}"}
  ]
}

The output must strictly follow the JSON schema you provide. This is by design and is what makes the model useful for structured document processing pipelines.

Gotchas I hit along the way

For reference, here are the build failures I ran into so you can avoid them:

No git in base image. The vllm/vllm-openai:nightly image does not ship with git. Any pip install git+https://... will fail with exit code 127. Install git first.

python vs python3. The base image only has python3 on PATH. Using python gives you "not found". Always use python3 explicitly.

huggingface-cli not on PATH. After upgrading huggingface-hub as a dependency of Transformers, the CLI binary can land somewhere outside of $PATH. Using the Python API directly (from huggingface_hub import snapshot_download) bypasses this entirely.

Transformers version conflict warning. vLLM pins transformers<5 in its dependencies. Installing the v5 dev branch triggers a pip warning. It is safe to ignore — vLLM runs fine and GLM-OCR requires it.

Wrapping up

Running GLM-OCR on serverless comes down to one key decision: bake everything into the image. The model needs a version of Transformers that no public image ships, and serverless cold starts punish you for anything that needs to be installed or downloaded at runtime. A 17-line Dockerfile solves both problems.

The full Dockerfile is at github.com/matija2209/ocr-docker. Fork it, point RunPod at it, and you have a serverless OCR endpoint that scales to zero and starts fast.

Let me know in the comments if you have questions, and subscribe for more practical development guides.

Thanks, Matija

Run GLM-OCR on RunPod Serverless: 17-line Dockerfile

🐳 Docker & DevOps Implementation Guides

Related Posts:

Why not just use a public image?

The Dockerfile

Installing git

Upgrading Transformers

Baking in the model weights

Deploying on RunPod Serverless

GLM-OCR prompt reference

Gotchas I hit along the way

Wrapping up

Frequently Asked Questions

Comments

No comments yet

You might be interested in

Resources

Headless CMS

Get in Touch