- Ultimate Guide: Run GLM-OCR Locally on MacBook Fast
Ultimate Guide: Run GLM-OCR Locally on MacBook Fast
Step-by-step Ollama setup for GLM-OCR on macOS — pull the model, set num_ctx=16384, and run a local OpenAI‑compatible…

📚 Get Practical Development Guides
Join developers getting comprehensive guides, code examples, optimization tips, and time-saving prompts to accelerate their development workflow.
Related Posts:
I spent an afternoon setting up GLM-OCR on RunPod Serverless with vLLM, custom Dockerfiles, CUDA version mismatches, and RunPod handler scripts. Then I realized the model is only 0.9B parameters and uses 2.5GB of memory. It runs on a MacBook.
If you just need document OCR for development, testing, or even light production use, you do not need cloud GPUs. This guide shows you how to go from zero to a working OCR API on your Mac in about five minutes. The same steps work for most models in the Ollama library.
Install Ollama
Ollama is a tool for running language models locally. It handles model downloads, quantization, and serves an API that is compatible with the OpenAI format. On macOS, install it with Homebrew.
brew install ollama
Start the Ollama service in the background so it runs automatically on login.
brew services start ollama
Ollama is now listening on http://localhost:11434. You can verify it is running with a quick health check.
curl http://localhost:11434/
You should see Ollama is running in the response.
Pull the GLM-OCR model
GLM-OCR is a 0.9B parameter vision model from Team GLM, designed specifically for document OCR. It handles text recognition, table extraction, formula parsing, and structured information extraction. The quantized version that Ollama downloads is about 2.2GB.
ollama pull glm-ocr
Once downloaded, confirm the model is available.
ollama list
You should see glm-ocr:latest in the output with a size of approximately 2.2GB.
The context size gotcha
This is the one thing that will trip you up. Ollama defaults to a context size of 4096 tokens, which is not enough for processing images. When GLM-OCR tries to encode an image with the default context, you get a cryptic crash.
GGML_ASSERT(a->ne[2] * 4 == b->ne[0]) failed
The fix is to set num_ctx to at least 16384 when making requests. I will show this in every example below so you do not have to debug it yourself.
Test from the command line
The simplest way to test is with Ollama's built-in CLI. Drag an image file into your terminal after the prompt.
ollama run glm-ocr "Text Recognition: ./path/to/your/document.png"
For a quick test, download a sample image first.
curl -sL -o /tmp/receipt.jpg "https://upload.wikimedia.org/wikipedia/commons/0/0b/ReceiptSwiss.jpg"
ollama run glm-ocr "Text Recognition: /tmp/receipt.jpg"
The model should return the text content of the receipt, including items, prices, and totals.
Use the API with Python
For integration into your own applications, Ollama serves an API on port 11434. Here is a complete working example that sends an image and gets back the recognized text.
# File: test_ocr.py
import base64
import json
import urllib.request
def ocr_image(image_path, prompt="Text Recognition:"):
with open(image_path, "rb") as f:
img_b64 = base64.b64encode(f.read()).decode()
data = json.dumps({
"model": "glm-ocr",
"messages": [
{
"role": "user",
"content": prompt,
"images": [img_b64]
}
],
"stream": False,
"options": {"num_ctx": 16384}
}).encode()
req = urllib.request.Request(
"http://localhost:11434/api/chat",
data=data,
headers={"Content-Type": "application/json"}
)
resp = urllib.request.urlopen(req, timeout=120)
result = json.loads(resp.read().decode())
return result["message"]["content"]
if __name__ == "__main__":
text = ocr_image("/tmp/receipt.jpg")
print(text)
Run it with python3 test_ocr.py. On an M1 Pro, expect about 40-50 seconds for image processing and a few seconds for text generation. The num_ctx: 16384 option in the request is critical. Without it, the model crashes on any non-trivial image.
The script uses only standard library modules so there is nothing extra to install. If you prefer the requests library or the official OpenAI Python SDK, those work too since Ollama serves an OpenAI-compatible API.
Use the OpenAI-compatible API
Ollama also serves an OpenAI-compatible endpoint at http://localhost:11434/v1. This means you can use the OpenAI Python SDK or any tool that supports custom API base URLs.
# File: test_ocr_openai.py
import base64
from openai import OpenAI
client = OpenAI(
api_key="ollama",
base_url="http://localhost:11434/v1",
)
with open("/tmp/receipt.jpg", "rb") as f:
img_b64 = base64.b64encode(f.read()).decode()
response = client.chat.completions.create(
model="glm-ocr",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{img_b64}"
}
},
{
"type": "text",
"text": "Text Recognition:"
}
]
}
],
)
print(response.choices[0].message.content)
This requires pip install openai but gives you the standard OpenAI interface. If you later move to a cloud-hosted model, you only change the base_url and api_key.
Supported prompts
GLM-OCR is not a general-purpose vision model. It responds to specific prompt formats.
For document parsing, use one of these exact strings as the text content:
Text Recognition:extracts raw text from the imageFormula Recognition:extracts mathematical formulas as LaTeXTable Recognition:extracts table structures
For structured information extraction, provide a JSON schema. The model fills in the values from the document.
prompt = """Please output the information in the image in the following JSON format:
{"name": "", "date": "", "total": "", "items": []}"""
result = ocr_image("/tmp/receipt.jpg", prompt=prompt)
print(result)
The model returns a JSON object matching your schema with values extracted from the image. This is particularly useful for invoices, ID cards, and forms where you know the structure upfront.
Performance on Apple Silicon
On my M1 Pro MacBook, GLM-OCR processes a typical document image in about 40-50 seconds. Most of that time is spent encoding the image. Text generation is fast at around 60 tokens per second.
The model uses about 2.5GB of memory during inference. Any Mac with 8GB or more of unified memory will run it comfortably.
If speed is critical for production workloads, a cloud GPU will process images in 2-3 seconds instead of 40-50. But for development, testing, and low-volume use, running locally saves you from managing infrastructure entirely.
Running other models
Everything in this guide applies to any model in the Ollama library. To try a different OCR or vision model, just swap the model name.
ollama pull llama3.2-vision
ollama run llama3.2-vision "Describe this image: ./photo.jpg"
The API calls are identical. Change the model field in your requests and everything else stays the same.
Wrapping up
GLM-OCR runs locally on a MacBook with Ollama in about five minutes of setup. Install Ollama, pull the model, set num_ctx to 16384 so it does not crash on images, and you have a working OCR API on localhost. No cloud accounts, no Docker, no GPU drivers.
The model handles text, tables, formulas, and structured extraction well for English and Chinese documents. For other languages, you will want a different model since GLM-OCR is bilingual only.
Let me know in the comments if you have questions, and subscribe for more practical development guides.
Thanks, Matija


