Do I need a GPU to run GLM-OCR locally?

No — GLM-OCR (0.9B) runs on Apple Silicon and other CPUs; a Mac with 8GB+ unified memory is sufficient for development and light use.

Why does the model crash without num_ctx?

Ollama defaults to a 4096 token context, which is too small for image encodings; set num_ctx to at least 16384 in requests to avoid the GGML_ASSERT crash.

Can I use other models with the same workflow?

Yes — just swap the model name (e.g., llama3.2-vision) in Ollama and your API calls; prompts and API formats remain the same.

Ultimate Guide: Run GLM-OCR Locally on MacBook Fast

I spent an afternoon setting up GLM-OCR on RunPod Serverless with vLLM, custom Dockerfiles, CUDA version mismatches, and RunPod handler scripts. Then I realized the model is only 0.9B parameters and uses 2.5GB of memory. It runs on a MacBook.

If you just need document OCR for development, testing, or even light production use, you do not need cloud GPUs. This guide shows you how to go from zero to a working OCR API on your Mac in about five minutes. The same steps work for most models in the Ollama library.

Install Ollama

Ollama is a tool for running language models locally. It handles model downloads, quantization, and serves an API that is compatible with the OpenAI format. On macOS, install it with Homebrew.

brew install ollama

Start the Ollama service in the background so it runs automatically on login.

brew services start ollama

Ollama is now listening on http://localhost:11434. You can verify it is running with a quick health check.

curl http://localhost:11434/

You should see Ollama is running in the response.

Pull the GLM-OCR model

GLM-OCR is a 0.9B parameter vision model from Team GLM, designed specifically for document OCR. It handles text recognition, table extraction, formula parsing, and structured information extraction. The quantized version that Ollama downloads is about 2.2GB.

ollama pull glm-ocr

Once downloaded, confirm the model is available.

ollama list

You should see glm-ocr:latest in the output with a size of approximately 2.2GB.

The context size gotcha

This is the one thing that will trip you up. Ollama defaults to a context size of 4096 tokens, which is not enough for processing images. When GLM-OCR tries to encode an image with the default context, you get a cryptic crash.

GGML_ASSERT(a->ne[2] * 4 == b->ne[0]) failed

The fix is to set num_ctx to at least 16384 when making requests. I will show this in every example below so you do not have to debug it yourself.

Test from the command line

The simplest way to test is with Ollama's built-in CLI. Drag an image file into your terminal after the prompt.

ollama run glm-ocr "Text Recognition: ./path/to/your/document.png"

For a quick test, download a sample image first.

curl -sL -o /tmp/receipt.jpg "https://upload.wikimedia.org/wikipedia/commons/0/0b/ReceiptSwiss.jpg"
ollama run glm-ocr "Text Recognition: /tmp/receipt.jpg"

The model should return the text content of the receipt, including items, prices, and totals.

Use the API with Python

For integration into your own applications, Ollama serves an API on port 11434. Here is a complete working example that sends an image and gets back the recognized text.

# File: test_ocr.py
import base64
import json
import urllib.request


def ocr_image(image_path, prompt="Text Recognition:"):
    with open(image_path, "rb") as f:
        img_b64 = base64.b64encode(f.read()).decode()

    data = json.dumps({
        "model": "glm-ocr",
        "messages": [
            {
                "role": "user",
                "content": prompt,
                "images": [img_b64]
            }
        ],
        "stream": False,
        "options": {"num_ctx": 16384}
    }).encode()

    req = urllib.request.Request(
        "http://localhost:11434/api/chat",
        data=data,
        headers={"Content-Type": "application/json"}
    )
    resp = urllib.request.urlopen(req, timeout=120)
    result = json.loads(resp.read().decode())
    return result["message"]["content"]


if __name__ == "__main__":
    text = ocr_image("/tmp/receipt.jpg")
    print(text)

Run it with python3 test_ocr.py. On an M1 Pro, expect about 40-50 seconds for image processing and a few seconds for text generation. The num_ctx: 16384 option in the request is critical. Without it, the model crashes on any non-trivial image.

The script uses only standard library modules so there is nothing extra to install. If you prefer the requests library or the official OpenAI Python SDK, those work too since Ollama serves an OpenAI-compatible API.

Use the OpenAI-compatible API

Ollama also serves an OpenAI-compatible endpoint at http://localhost:11434/v1. This means you can use the OpenAI Python SDK or any tool that supports custom API base URLs.

# File: test_ocr_openai.py
import base64
from openai import OpenAI

client = OpenAI(
    api_key="ollama",
    base_url="http://localhost:11434/v1",
)

with open("/tmp/receipt.jpg", "rb") as f:
    img_b64 = base64.b64encode(f.read()).decode()

response = client.chat.completions.create(
    model="glm-ocr",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/jpeg;base64,{img_b64}"
                    }
                },
                {
                    "type": "text",
                    "text": "Text Recognition:"
                }
            ]
        }
    ],
)

print(response.choices[0].message.content)

This requires pip install openai but gives you the standard OpenAI interface. If you later move to a cloud-hosted model, you only change the base_url and api_key.

Supported prompts

GLM-OCR is not a general-purpose vision model. It responds to specific prompt formats.

For document parsing, use one of these exact strings as the text content:

Text Recognition: extracts raw text from the image
Formula Recognition: extracts mathematical formulas as LaTeX
Table Recognition: extracts table structures

For structured information extraction, provide a JSON schema. The model fills in the values from the document.

prompt = """Please output the information in the image in the following JSON format:
{"name": "", "date": "", "total": "", "items": []}"""

result = ocr_image("/tmp/receipt.jpg", prompt=prompt)
print(result)

The model returns a JSON object matching your schema with values extracted from the image. This is particularly useful for invoices, ID cards, and forms where you know the structure upfront.

Performance on Apple Silicon

On my M1 Pro MacBook, GLM-OCR processes a typical document image in about 40-50 seconds. Most of that time is spent encoding the image. Text generation is fast at around 60 tokens per second.

The model uses about 2.5GB of memory during inference. Any Mac with 8GB or more of unified memory will run it comfortably.

If speed is critical for production workloads, a cloud GPU will process images in 2-3 seconds instead of 40-50. But for development, testing, and low-volume use, running locally saves you from managing infrastructure entirely.

Running other models

Everything in this guide applies to any model in the Ollama library. To try a different OCR or vision model, just swap the model name.

ollama pull llama3.2-vision
ollama run llama3.2-vision "Describe this image: ./photo.jpg"

The API calls are identical. Change the model field in your requests and everything else stays the same.

Wrapping up

GLM-OCR runs locally on a MacBook with Ollama in about five minutes of setup. Install Ollama, pull the model, set num_ctx to 16384 so it does not crash on images, and you have a working OCR API on localhost. No cloud accounts, no Docker, no GPU drivers.

The model handles text, tables, formulas, and structured extraction well for English and Chinese documents. For other languages, you will want a different model since GLM-OCR is bilingual only.

Let me know in the comments if you have questions, and subscribe for more practical development guides.

Thanks, Matija

Ultimate Guide: Run GLM-OCR Locally on MacBook Fast

📚 Get Practical Development Guides

Related Posts:

Install Ollama

Pull the GLM-OCR model

The context size gotcha

Test from the command line

Use the API with Python

Use the OpenAI-compatible API

Supported prompts

Performance on Apple Silicon

Running other models

Wrapping up

Frequently Asked Questions

Comments

No comments yet

You might be interested in

Resources

Headless CMS

Get in Touch