• Home
BuildWithMatija
Get In Touch
  1. Home
  2. Blog
  3. AI
  4. Ultimate Guide: Run GLM-OCR Locally on MacBook Fast

Ultimate Guide: Run GLM-OCR Locally on MacBook Fast

Step-by-step Ollama setup for GLM-OCR on macOS — pull the model, set num_ctx=16384, and run a local OpenAI‑compatible…

10th February 2026·Updated on:22nd February 2026·MŽMatija Žiberna·
AI
Ultimate Guide: Run GLM-OCR Locally on MacBook Fast

📚 Get Practical Development Guides

Join developers getting comprehensive guides, code examples, optimization tips, and time-saving prompts to accelerate their development workflow.

No spam. Unsubscribe anytime.

Related Posts:

  • •Run GLM-OCR on RunPod Serverless: 17-line Dockerfile
  • •Sales Enablement Website: What Role Should It Play?
  • •LLM Inference Engine Showdown: vLLM vs Ollama vs TGI

I spent an afternoon setting up GLM-OCR on RunPod Serverless with vLLM, custom Dockerfiles, CUDA version mismatches, and RunPod handler scripts. Then I realized the model is only 0.9B parameters and uses 2.5GB of memory. It runs on a MacBook.

If you just need document OCR for development, testing, or even light production use, you do not need cloud GPUs. This guide shows you how to go from zero to a working OCR API on your Mac in about five minutes. The same steps work for most models in the Ollama library.

Install Ollama

Ollama is a tool for running language models locally. It handles model downloads, quantization, and serves an API that is compatible with the OpenAI format. On macOS, install it with Homebrew.

brew install ollama

Start the Ollama service in the background so it runs automatically on login.

brew services start ollama

Ollama is now listening on http://localhost:11434. You can verify it is running with a quick health check.

curl http://localhost:11434/

You should see Ollama is running in the response.

Pull the GLM-OCR model

GLM-OCR is a 0.9B parameter vision model from Team GLM, designed specifically for document OCR. It handles text recognition, table extraction, formula parsing, and structured information extraction. The quantized version that Ollama downloads is about 2.2GB.

ollama pull glm-ocr

Once downloaded, confirm the model is available.

ollama list

You should see glm-ocr:latest in the output with a size of approximately 2.2GB.

The context size gotcha

This is the one thing that will trip you up. Ollama defaults to a context size of 4096 tokens, which is not enough for processing images. When GLM-OCR tries to encode an image with the default context, you get a cryptic crash.

GGML_ASSERT(a->ne[2] * 4 == b->ne[0]) failed

The fix is to set num_ctx to at least 16384 when making requests. I will show this in every example below so you do not have to debug it yourself.

Test from the command line

The simplest way to test is with Ollama's built-in CLI. Drag an image file into your terminal after the prompt.

ollama run glm-ocr "Text Recognition: ./path/to/your/document.png"

For a quick test, download a sample image first.

curl -sL -o /tmp/receipt.jpg "https://upload.wikimedia.org/wikipedia/commons/0/0b/ReceiptSwiss.jpg"
ollama run glm-ocr "Text Recognition: /tmp/receipt.jpg"

The model should return the text content of the receipt, including items, prices, and totals.

Use the API with Python

For integration into your own applications, Ollama serves an API on port 11434. Here is a complete working example that sends an image and gets back the recognized text.

# File: test_ocr.py
import base64
import json
import urllib.request


def ocr_image(image_path, prompt="Text Recognition:"):
    with open(image_path, "rb") as f:
        img_b64 = base64.b64encode(f.read()).decode()

    data = json.dumps({
        "model": "glm-ocr",
        "messages": [
            {
                "role": "user",
                "content": prompt,
                "images": [img_b64]
            }
        ],
        "stream": False,
        "options": {"num_ctx": 16384}
    }).encode()

    req = urllib.request.Request(
        "http://localhost:11434/api/chat",
        data=data,
        headers={"Content-Type": "application/json"}
    )
    resp = urllib.request.urlopen(req, timeout=120)
    result = json.loads(resp.read().decode())
    return result["message"]["content"]


if __name__ == "__main__":
    text = ocr_image("/tmp/receipt.jpg")
    print(text)

Run it with python3 test_ocr.py. On an M1 Pro, expect about 40-50 seconds for image processing and a few seconds for text generation. The num_ctx: 16384 option in the request is critical. Without it, the model crashes on any non-trivial image.

The script uses only standard library modules so there is nothing extra to install. If you prefer the requests library or the official OpenAI Python SDK, those work too since Ollama serves an OpenAI-compatible API.

Use the OpenAI-compatible API

Ollama also serves an OpenAI-compatible endpoint at http://localhost:11434/v1. This means you can use the OpenAI Python SDK or any tool that supports custom API base URLs.

# File: test_ocr_openai.py
import base64
from openai import OpenAI

client = OpenAI(
    api_key="ollama",
    base_url="http://localhost:11434/v1",
)

with open("/tmp/receipt.jpg", "rb") as f:
    img_b64 = base64.b64encode(f.read()).decode()

response = client.chat.completions.create(
    model="glm-ocr",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/jpeg;base64,{img_b64}"
                    }
                },
                {
                    "type": "text",
                    "text": "Text Recognition:"
                }
            ]
        }
    ],
)

print(response.choices[0].message.content)

This requires pip install openai but gives you the standard OpenAI interface. If you later move to a cloud-hosted model, you only change the base_url and api_key.

Supported prompts

GLM-OCR is not a general-purpose vision model. It responds to specific prompt formats.

For document parsing, use one of these exact strings as the text content:

  • Text Recognition: extracts raw text from the image
  • Formula Recognition: extracts mathematical formulas as LaTeX
  • Table Recognition: extracts table structures

For structured information extraction, provide a JSON schema. The model fills in the values from the document.

prompt = """Please output the information in the image in the following JSON format:
{"name": "", "date": "", "total": "", "items": []}"""

result = ocr_image("/tmp/receipt.jpg", prompt=prompt)
print(result)

The model returns a JSON object matching your schema with values extracted from the image. This is particularly useful for invoices, ID cards, and forms where you know the structure upfront.

Performance on Apple Silicon

On my M1 Pro MacBook, GLM-OCR processes a typical document image in about 40-50 seconds. Most of that time is spent encoding the image. Text generation is fast at around 60 tokens per second.

The model uses about 2.5GB of memory during inference. Any Mac with 8GB or more of unified memory will run it comfortably.

If speed is critical for production workloads, a cloud GPU will process images in 2-3 seconds instead of 40-50. But for development, testing, and low-volume use, running locally saves you from managing infrastructure entirely.

Running other models

Everything in this guide applies to any model in the Ollama library. To try a different OCR or vision model, just swap the model name.

ollama pull llama3.2-vision
ollama run llama3.2-vision "Describe this image: ./photo.jpg"

The API calls are identical. Change the model field in your requests and everything else stays the same.

Wrapping up

GLM-OCR runs locally on a MacBook with Ollama in about five minutes of setup. Install Ollama, pull the model, set num_ctx to 16384 so it does not crash on images, and you have a working OCR API on localhost. No cloud accounts, no Docker, no GPU drivers.

The model handles text, tables, formulas, and structured extraction well for English and Chinese documents. For other languages, you will want a different model since GLM-OCR is bilingual only.

Let me know in the comments if you have questions, and subscribe for more practical development guides.

Thanks, Matija

📄View markdown version
0

Frequently Asked Questions

Comments

Leave a Comment

Your email will not be published

Stay updated! Get our weekly digest with the latest learnings on NextJS, React, AI, and web development tips delivered straight to your inbox.

10-2000 characters

• Comments are automatically approved and will appear immediately

• Your name and email will be saved for future comments

• Be respectful and constructive in your feedback

• No spam, self-promotion, or off-topic content

Matija Žiberna
Matija Žiberna
Full-stack developer, co-founder

I'm Matija Žiberna, a self-taught full-stack developer and co-founder passionate about building products, writing clean code, and figuring out how to turn ideas into businesses. I write about web development with Next.js, lessons from entrepreneurship, and the journey of learning by doing. My goal is to provide value through code—whether it's through tools, content, or real-world software.

You might be interested in

Run GLM-OCR on RunPod Serverless: 17-line Dockerfile
Run GLM-OCR on RunPod Serverless: 17-line Dockerfile

9th February 2026

Sales Enablement Website: What Role Should It Play?
Sales Enablement Website: What Role Should It Play?

14th January 2026

LLM Inference Engine Showdown: vLLM vs Ollama vs TGI
LLM Inference Engine Showdown: vLLM vs Ollama vs TGI

7th February 2026

Table of Contents

  • Install Ollama
  • Pull the GLM-OCR model
  • The context size gotcha
  • Test from the command line
  • Use the API with Python
  • Use the OpenAI-compatible API
  • Supported prompts
  • Performance on Apple Silicon
  • Running other models
  • Wrapping up
On this page:
  • Install Ollama
  • Pull the GLM-OCR model
  • The context size gotcha
  • Test from the command line
  • Use the API with Python
Build With Matija Logo

Build with Matija

Matija Žiberna

I turn scattered business knowledge into one usable system. End-to-end system architecture, AI integration, and development.

Quick Links

Payload CMS Websites
  • Bespoke AI Applications
  • Projects
  • How I Work
  • Blog
  • Get in Touch

    Have a project in mind? Let's discuss how we can help your business grow.

    Contact me →
    © 2026BuildWithMatija•Principal-led system architecture•All rights reserved