OCR Fine-Tuning

Fine-tune OCR models on Apple Silicon. mlx-tune supports both dedicated OCR architectures and general VLMs adapted for document understanding — all with native LoRA, built-in evaluation metrics, and CER-based GRPO training.

Overview

Two tracks for OCR

mlx-tune provides two complementary approaches to OCR fine-tuning, depending on your accuracy and latency requirements.

Dedicated OCR Models

Purpose-built for text recognition. These models (DeepSeek-OCR, GLM-OCR, DOTS-OCR) are small, fast, and optimized for document transcription. They achieve high accuracy on structured text (invoices, receipts, printed documents) while running efficiently on Apple Silicon.

General VLM → OCR

Large vision-language models like Qwen3.5, Qwen2.5-VL, and Pixtral can be fine-tuned for OCR tasks using the same VLM pipeline. This approach excels on complex layouts, handwriting, and documents requiring reasoning (e.g., form understanding, table extraction). Use FastVisionModel from the VLM track for these models.

What makes mlx-tune OCR special

Apple Silicon native — No CUDA required. Train and evaluate OCR models entirely on your Mac using the MLX framework.
LoRA fine-tuning — Efficient adapter-based training. Fine-tune dedicated OCR models with as little as 8 GB unified memory.
Built-in evaluation — CER, WER, and exact match metrics computed automatically during training and on demand.
GRPO with OCR rewards — Train with character error rate-based reward functions to directly optimize transcription quality.
Batch transcription — Process entire document sets efficiently with batch_transcribe().

Models

Supported OCR Models

Dedicated OCR models supported by FastOCRModel, plus general VLMs that can be fine-tuned for OCR via the VLM track.

Dedicated OCR Models

Model	Parameters	Type	Quantization
`DeepSeek-OCR`	0.9B	Dedicated	4-bit, bf16
`DeepSeek-OCR-2`	1B	Dedicated	4-bit, bf16
`GLM-OCR`	0.9B	Dedicated	4-bit, bf16
`DOTS-OCR`	varies	Dedicated	4-bit, bf16

Derived OCR Models

Model	Parameters	Type	Quantization
`olmOCR-2`	7B	VLM-derived	4-bit, 8-bit
`LightOnOCR-1B`	1B	VLM-derived	4-bit, bf16

General VLMs for OCR

Model	Parameters	Type	Quantization
`Qwen3.5`	0.8B–32B	General VLM	4-bit, 8-bit, bf16
`Qwen2.5-VL`	3B–72B	General VLM	4-bit, 8-bit, bf16
`Pixtral`	12B	General VLM	4-bit, 8-bit

Which track to use?

Use FastOCRModel for dedicated and derived OCR models listed above. For general VLMs (Qwen3.5, Qwen2.5-VL, Pixtral), use FastVisionModel from the VLM track — they share the same training pipeline.

FastOCRModel

mlx_tune.ocr

FastOCRModel.from_pretrained()

FastOCRModel.from_pretrained(model_name, max_seq_length=4096, load_in_4bit=False, ...) → Tuple[OCRModelWrapper, Processor]

Load a dedicated OCR model from HuggingFace. Returns a model wrapper and a processor that handles both image preprocessing and text tokenization.

Parameter	Type	Description
`model_name`	str	HuggingFace model ID (e.g., `"deepseek-ai/DeepSeek-OCR"`) or local path
`max_seq_length`	int	Maximum sequence length. OCR outputs can be long; 4096 is recommended
`load_in_4bit`	bool	Load model with 4-bit quantization (reduces memory by ~75%)

FastOCRModel.get_peft_model()

FastOCRModel.get_peft_model(model, finetune_vision_layers=False, finetune_language_layers=True, r=16, lora_alpha=16, ...) → OCRModelWrapper

Add LoRA adapters to the OCR model. By default, only the language layers are fine-tuned — the vision encoder is kept frozen since dedicated OCR models already have strong visual features.

Parameter	Type	Default	Description
`finetune_vision_layers`	bool	`False`	Apply LoRA to vision encoder. Usually not needed for dedicated OCR models
`finetune_language_layers`	bool	`True`	Apply LoRA to language model layers
`r`	int	`16`	LoRA rank
`lora_alpha`	int	`16`	LoRA scaling factor. Recommended: equal to `r`

Why vision_layers=False by default?

Dedicated OCR models are pre-trained with vision encoders specifically optimized for document images. Fine-tuning only the language layers is usually sufficient and significantly faster. Enable finetune_vision_layers=True if you are working with unusual image formats (e.g., handwritten notes, degraded scans).

FastOCRModel.for_training()

FastOCRModel.for_training(model)

Enable training mode. Required before starting any training loop.

FastOCRModel.for_inference()

FastOCRModel.for_inference(model)

Enable inference mode. Activates KV caching and disables dropout. Always call before generating or evaluating.

OCRModelWrapper

mlx_tune.ocr

Wrapper returned by FastOCRModel.from_pretrained(). Provides OCR-specific methods for transcription, batch processing, evaluation, and model persistence.

model.transcribe()

model.transcribe(image, prompt="Transcribe this image.", max_tokens=2048, temperature=0.0) → str

Transcribe text from a single image. Returns the extracted text as a string.

Parameter	Type	Description
`image`	str or PIL.Image	Path to an image file, or a PIL Image object
`prompt`	str	Instruction prompt for the OCR model
`max_tokens`	int	Maximum number of tokens to generate
`temperature`	float	Sampling temperature (0.0 = greedy, recommended for OCR)

model.batch_transcribe()

model.batch_transcribe(images, prompt="Transcribe this image.", max_tokens=2048) → list[str]

Transcribe text from a batch of images. Processes images sequentially and returns a list of transcription strings.

Parameter	Type	Description
`images`	list[str] or list[PIL.Image]	List of image paths or PIL Image objects
`prompt`	str	Instruction prompt applied to all images
`max_tokens`	int	Maximum tokens per transcription

model.evaluate()

model.evaluate(images, references, prompt="Transcribe this image.") → dict

Evaluate model transcription quality against ground truth references. Returns a dictionary with CER, WER, and exact match scores.

Parameter	Type	Description
`images`	list[str] or list[PIL.Image]	Images to transcribe
`references`	list[str]	Ground truth text for each image
`prompt`	str	Instruction prompt

Returns:

{
    "cer": 0.032,          # Character Error Rate (lower is better)
    "wer": 0.087,          # Word Error Rate (lower is better)
    "exact_match": 0.85,   # Fraction of perfect transcriptions
    "num_samples": 100,
}

model.save_pretrained()

model.save_pretrained(output_dir)

Save LoRA adapters to disk. Writes adapters.safetensors, adapter_config.json, and config.json.

model.load_adapter()

model.load_adapter(adapter_path)

Load previously saved LoRA adapters into the model.

model.save_pretrained_merged()

model.save_pretrained_merged(output_dir, processor)

Fuse LoRA weights into the base model and save the full merged model.

Training

mlx_tune.ocr

OCRSFTTrainer

OCRSFTTrainer(model, tokenizer, data_collator, train_dataset, args=None)

Supervised fine-tuning trainer for OCR models. Handles forward pass, loss computation, and gradient updates with OCR-optimized defaults.

Parameter	Type	Description
`model`	OCRModelWrapper	Model with LoRA adapters configured
`tokenizer`	Processor	Processor from `FastOCRModel.from_pretrained()`
`data_collator`	OCRDataCollator	Data collator for image/text batching
`train_dataset`	Dataset	HuggingFace dataset with `image` and `text` fields
`args`	OCRSFTConfig	Training configuration

trainer.train() — Start training. Returns training statistics.

OCRSFTConfig

OCRSFTConfig(learning_rate=5e-5, max_length=4096, max_steps=100, per_device_train_batch_size=1, output_dir="./ocr_outputs", train_on_completions=True, gradient_accumulation_steps=4, logging_steps=5, save_steps=50, ...)

Training configuration for OCR fine-tuning.

Parameter	Default	Description
`learning_rate`	`5e-5`	Peak learning rate. Lower than VLM default since OCR models converge quickly
`max_length`	`4096`	Maximum sequence length for OCR outputs (long documents need high values)
`max_steps`	`100`	Total training steps
`per_device_train_batch_size`	`1`	Batch size (forced to 1, same as VLM)
`output_dir`	`"./ocr_outputs"`	Directory for checkpoints and logs
`train_on_completions`	`True`	Compute loss only on transcription tokens, not the prompt
`gradient_accumulation_steps`	`4`	Steps to accumulate gradients before updating
`logging_steps`	`5`	Log metrics every N steps
`save_steps`	`50`	Save checkpoint every N steps

OCRDataCollator

OCRDataCollator(model, processor)

Data collator for OCR tasks. Handles image preprocessing, prompt formatting, and token preparation for training.

SFT usage example

from mlx_tune.ocr import FastOCRModel, OCRSFTTrainer, OCRSFTConfig, OCRDataCollator
from datasets import load_dataset

# Load a dedicated OCR model
model, processor = FastOCRModel.from_pretrained(
    "deepseek-ai/DeepSeek-OCR",
    max_seq_length=4096,
)

# Add LoRA (language layers only by default)
model = FastOCRModel.get_peft_model(model, r=16, lora_alpha=16)

# Load your OCR dataset (image + text pairs)
dataset = load_dataset("your-ocr-dataset", split="train[:200]")

# Train
FastOCRModel.for_training(model)
collator = OCRDataCollator(model, processor)
trainer = OCRSFTTrainer(
    model=model, tokenizer=processor,
    data_collator=collator, train_dataset=dataset,
    args=OCRSFTConfig(output_dir="./ocr_output", max_steps=60),
)
trainer.train()

# Evaluate
FastOCRModel.for_inference(model)
metrics = model.evaluate(test_images, test_references)
print(f"CER: {metrics['cer']:.3f}, WER: {metrics['wer']:.3f}")

OCRGRPOTrainer

OCRGRPOTrainer(model, train_dataset, processor, reward_fn, args=None)

Group Relative Policy Optimization (GRPO) trainer for OCR models. Generates multiple transcriptions per image, scores them with character error rate-based reward functions, and updates the model to favor more accurate transcriptions.

Parameter	Type	Description
`model`	OCRModelWrapper	Model with LoRA adapters from `FastOCRModel.get_peft_model()`
`train_dataset`	list[dict]	List of dicts with `image` (path or PIL Image) and `text` (ground truth) keys
`processor`	Processor	Processor from `FastOCRModel.from_pretrained()`
`reward_fn`	Callable	Function `(transcription, ground_truth) → float` that scores each transcription
`args`	OCRGRPOConfig	Training configuration

trainer.train() — Start GRPO training. Returns training statistics.

OCRGRPOConfig

OCRGRPOConfig(beta=0.04, num_generations=4, temperature=0.7, max_completion_length=2048, output_dir="./ocr_grpo_outputs", learning_rate=1e-6, max_steps=-1, logging_steps=1, save_steps=100)

Configuration for OCR GRPO training.

Parameter	Default	Description
`beta`	`0.04`	KL penalty coefficient
`num_generations`	`4`	Transcriptions generated per image for advantage estimation
`temperature`	`0.7`	Sampling temperature for generation
`max_completion_length`	`2048`	Maximum tokens per transcription
`output_dir`	`"./ocr_grpo_outputs"`	Directory for checkpoints
`learning_rate`	`1e-6`	Learning rate (lower than SFT)
`max_steps`	`-1`	`-1` trains for one full epoch
`logging_steps`	`1`	Log every N steps
`save_steps`	`100`	Save checkpoint every N steps

How OCR GRPO works

For each document image, the trainer generates num_generations transcription attempts at non-zero temperature. Each transcription is scored by the reward_fn (typically based on CER against ground truth). The model is then updated via policy gradients to favor more accurate transcriptions while a KL penalty keeps it close to the reference policy.

GRPO usage example

from mlx_tune.ocr import FastOCRModel, OCRGRPOTrainer, OCRGRPOConfig, cer_reward

# Load and configure model
model, processor = FastOCRModel.from_pretrained("deepseek-ai/DeepSeek-OCR")
model = FastOCRModel.get_peft_model(model, r=16, lora_alpha=16)

# Prepare dataset: list of dicts with image and ground truth text
ocr_data = [
    {"image": "receipt_001.png", "text": "Total: $42.50\nTax: $3.19"},
    {"image": "receipt_002.png", "text": "Subtotal: $18.00\nTip: $3.60"},
    # ...
]

# Train with CER-based rewards
FastOCRModel.for_training(model)
trainer = OCRGRPOTrainer(
    model=model,
    train_dataset=ocr_data,
    processor=processor,
    reward_fn=cer_reward,
    args=OCRGRPOConfig(num_generations=4, max_steps=20),
)
result = trainer.train()

Evaluation

Evaluation Metrics

mlx-tune provides three standard OCR evaluation metrics, computed by model.evaluate() or available as standalone functions.

Character Error Rate (CER)

The edit distance between predicted and reference text, normalized by reference length. The primary metric for OCR quality.

CER = edit_distance(prediction, reference) / len(reference)

# Example:
# prediction: "Helo World"
# reference:  "Hello World"
# edit_distance = 1 (missing 'l')
# CER = 1 / 11 = 0.091

Range: 0.0 (perfect) to 1.0+ (CER can exceed 1.0 if the prediction is much longer than the reference).

Word Error Rate (WER)

The edit distance computed at the word level. More interpretable for natural language but less granular than CER.

WER = edit_distance(prediction_words, reference_words) / len(reference_words)

# Example:
# prediction: "Hello Wrold"
# reference:  "Hello World"
# word edit_distance = 1 ("Wrold" != "World")
# WER = 1 / 2 = 0.50

Exact Match

The fraction of samples where the prediction matches the reference exactly (after whitespace normalization).

exact_match = sum(pred.strip() == ref.strip() for pred, ref in pairs) / len(pairs)

# Example with 4 samples:
# 3 exact matches out of 4
# exact_match = 0.75

Rewards

Reward Functions

Built-in reward functions for OCR GRPO training. These can be used directly with OCRGRPOTrainer or combined into custom reward functions.

cer_reward

cer_reward(transcription, ground_truth) → float

Returns 1.0 - CER, clamped to [0.0, 1.0]. A perfect transcription scores 1.0; a completely wrong one scores 0.0.

from mlx_tune.ocr import cer_reward

score = cer_reward("Hello World", "Hello World")   # 1.0
score = cer_reward("Helo World", "Hello World")     # 0.909
score = cer_reward("xyz", "Hello World")            # ~0.27

exact_match_reward

exact_match_reward(transcription, ground_truth) → float

Returns 1.0 if the transcription exactly matches the ground truth (after whitespace normalization), 0.0 otherwise. A strict binary reward.

from mlx_tune.ocr import exact_match_reward

score = exact_match_reward("Hello World", "Hello World")   # 1.0
score = exact_match_reward("Hello World ", "Hello World")   # 1.0 (whitespace normalized)
score = exact_match_reward("Helo World", "Hello World")     # 0.0

combined_ocr_reward

combined_ocr_reward(transcription, ground_truth, cer_weight=0.7, exact_weight=0.3) → float

Weighted combination of CER reward and exact match reward. Balances continuous improvement (CER) with the incentive for perfect transcriptions (exact match).

from mlx_tune.ocr import combined_ocr_reward

# Combine CER (70%) and exact match (30%)
score = combined_ocr_reward("Hello World", "Hello World")   # 1.0
score = combined_ocr_reward("Helo World", "Hello World")     # 0.636

# Custom weights
score = combined_ocr_reward("Helo World", "Hello World",
                            cer_weight=0.5, exact_weight=0.5)  # 0.455

Custom reward function

You can write your own reward function for domain-specific OCR tasks.

import re

def receipt_reward(transcription, ground_truth):
    """Reward that prioritizes getting dollar amounts correct."""
    # Extract dollar amounts from both strings
    pred_amounts = set(re.findall(r'\$[\d.]+', transcription))
    true_amounts = set(re.findall(r'\$[\d.]+', ground_truth))

    if not true_amounts:
        return cer_reward(transcription, ground_truth)

    # 60% weight on amount accuracy, 40% on overall CER
    amount_score = len(pred_amounts & true_amounts) / len(true_amounts)
    cer_score = max(0.0, 1.0 - cer(transcription, ground_truth))
    return 0.6 * amount_score + 0.4 * cer_score

Post-Training

Save, load, and merge

After training, save adapters for later use, reload them, or merge LoRA weights into the base model.

Save adapters

# Save LoRA adapters only (small files)
model.save_pretrained("./ocr_adapters")

Load adapters

# Load adapters into a fresh model
model, processor = FastOCRModel.from_pretrained("deepseek-ai/DeepSeek-OCR")
model.load_adapter("./ocr_adapters")

Merge LoRA into base model

# Fuse LoRA weights and save full model
model.save_pretrained_merged("./ocr_merged", processor)

Evaluate after loading

# Evaluate a saved model
model, processor = FastOCRModel.from_pretrained("deepseek-ai/DeepSeek-OCR")
model.load_adapter("./ocr_adapters")
FastOCRModel.for_inference(model)

metrics = model.evaluate(test_images, test_references)
print(f"CER: {metrics['cer']:.3f}, WER: {metrics['wer']:.3f}, "
      f"Exact Match: {metrics['exact_match']:.1%}")

Examples

Working examples

Complete scripts you can run directly.

33 — DeepSeek-OCR Fine-Tuning

Fine-tune DeepSeek-OCR on document images with LoRA. Includes evaluation with CER/WER metrics.

OCRSFTDeepSeek

View source →

34 — GLM-OCR Fine-Tuning

Fine-tune GLM-OCR for specialized document types. Receipt and invoice transcription with custom prompts.

OCRSFTGLM

View source →

35 — OCR GRPO Training

Train an OCR model with GRPO using CER-based rewards. Directly optimizes transcription accuracy via reinforcement learning.

OCRGRPORL

View source →

36 — olmOCR-2 Fine-Tuning

Fine-tune the 7B olmOCR-2 model for complex document understanding. Tables, forms, and multi-column layouts.

OCRSFT7B

View source →

37 — OCR Evaluation Pipeline

Complete evaluation pipeline: load a fine-tuned model, run batch transcription, compute CER/WER/exact match, and compare against baseline.

OCREvaluationMetrics

View source →

Tips

Best practices

Use greedy decoding for OCR

Set temperature=0.0 for transcription and evaluation. Non-zero temperature introduces randomness that reduces OCR accuracy. Only use non-zero temperature during GRPO training where diversity is needed for advantage estimation.

Increase max_length for long documents

Default max_length=4096 is sufficient for most single-page documents. For multi-page or dense documents, increase to 8192 or higher. Monitor for truncation in your evaluation metrics.

Dedicated vs. general VLMs

Dedicated OCR models (DeepSeek-OCR, GLM-OCR) are smaller and faster but focused on text extraction. General VLMs (Qwen3.5, Qwen2.5-VL) are larger but can handle complex tasks like table understanding, form field extraction, and document reasoning. Choose based on your task complexity.

Image preprocessing

OCR models work best with clean, well-lit images. For degraded scans, consider preprocessing with contrast enhancement or deskewing before training and inference. The processor handles resizing automatically.

Memory requirements

Dedicated OCR models (0.9B–1B) run comfortably on 8 GB unified memory. The 7B olmOCR-2 in 4-bit quantization needs 16 GB+. General VLMs follow the same memory guidelines as the VLM track.

Batch size must be 1

Like VLM training, OCR training is forced to batch_size=1 since each document image produces a different number of vision tokens. Use gradient_accumulation_steps to simulate larger effective batch sizes.