OCR Fine-Tuning
Fine-tune OCR models on Apple Silicon. mlx-tune supports both dedicated OCR architectures and general VLMs adapted for document understanding — all with native LoRA, built-in evaluation metrics, and CER-based GRPO training.
Two tracks for OCR
mlx-tune provides two complementary approaches to OCR fine-tuning, depending on your accuracy and latency requirements.
Purpose-built for text recognition. These models (DeepSeek-OCR, GLM-OCR, DOTS-OCR) are small, fast, and optimized for document transcription. They achieve high accuracy on structured text (invoices, receipts, printed documents) while running efficiently on Apple Silicon.
Large vision-language models like Qwen3.5, Qwen2.5-VL, and Pixtral can be fine-tuned for OCR tasks using the same VLM pipeline. This approach excels on complex layouts, handwriting, and documents requiring reasoning (e.g., form understanding, table extraction). Use FastVisionModel from the VLM track for these models.
What makes mlx-tune OCR special
- Apple Silicon native — No CUDA required. Train and evaluate OCR models entirely on your Mac using the MLX framework.
- LoRA fine-tuning — Efficient adapter-based training. Fine-tune dedicated OCR models with as little as 8 GB unified memory.
- Built-in evaluation — CER, WER, and exact match metrics computed automatically during training and on demand.
- GRPO with OCR rewards — Train with character error rate-based reward functions to directly optimize transcription quality.
- Batch transcription — Process entire document sets efficiently with
batch_transcribe().
Supported OCR Models
Dedicated OCR models supported by FastOCRModel, plus general VLMs that can be fine-tuned for OCR via the VLM track.
Dedicated OCR Models
| Model | Parameters | Type | Quantization |
|---|---|---|---|
DeepSeek-OCR | 0.9B | Dedicated | 4-bit, bf16 |
DeepSeek-OCR-2 | 1B | Dedicated | 4-bit, bf16 |
GLM-OCR | 0.9B | Dedicated | 4-bit, bf16 |
DOTS-OCR | varies | Dedicated | 4-bit, bf16 |
Derived OCR Models
| Model | Parameters | Type | Quantization |
|---|---|---|---|
olmOCR-2 | 7B | VLM-derived | 4-bit, 8-bit |
LightOnOCR-1B | 1B | VLM-derived | 4-bit, bf16 |
General VLMs for OCR
| Model | Parameters | Type | Quantization |
|---|---|---|---|
Qwen3.5 | 0.8B–32B | General VLM | 4-bit, 8-bit, bf16 |
Qwen2.5-VL | 3B–72B | General VLM | 4-bit, 8-bit, bf16 |
Pixtral | 12B | General VLM | 4-bit, 8-bit |
Use FastOCRModel for dedicated and derived OCR models listed above. For general VLMs (Qwen3.5, Qwen2.5-VL, Pixtral), use FastVisionModel from the VLM track — they share the same training pipeline.
FastOCRModel
mlx_tune.ocrFastOCRModel.from_pretrained()
Load a dedicated OCR model from HuggingFace. Returns a model wrapper and a processor that handles both image preprocessing and text tokenization.
| Parameter | Type | Description |
|---|---|---|
model_name | str | HuggingFace model ID (e.g., "deepseek-ai/DeepSeek-OCR") or local path |
max_seq_length | int | Maximum sequence length. OCR outputs can be long; 4096 is recommended |
load_in_4bit | bool | Load model with 4-bit quantization (reduces memory by ~75%) |
FastOCRModel.get_peft_model()
Add LoRA adapters to the OCR model. By default, only the language layers are fine-tuned — the vision encoder is kept frozen since dedicated OCR models already have strong visual features.
| Parameter | Type | Default | Description |
|---|---|---|---|
finetune_vision_layers | bool | False | Apply LoRA to vision encoder. Usually not needed for dedicated OCR models |
finetune_language_layers | bool | True | Apply LoRA to language model layers |
r | int | 16 | LoRA rank |
lora_alpha | int | 16 | LoRA scaling factor. Recommended: equal to r |
Dedicated OCR models are pre-trained with vision encoders specifically optimized for document images. Fine-tuning only the language layers is usually sufficient and significantly faster. Enable finetune_vision_layers=True if you are working with unusual image formats (e.g., handwritten notes, degraded scans).
FastOCRModel.for_training()
Enable training mode. Required before starting any training loop.
FastOCRModel.for_inference()
Enable inference mode. Activates KV caching and disables dropout. Always call before generating or evaluating.
OCRModelWrapper
mlx_tune.ocrWrapper returned by FastOCRModel.from_pretrained(). Provides OCR-specific methods for transcription, batch processing, evaluation, and model persistence.
model.transcribe()
Transcribe text from a single image. Returns the extracted text as a string.
| Parameter | Type | Description |
|---|---|---|
image | str or PIL.Image | Path to an image file, or a PIL Image object |
prompt | str | Instruction prompt for the OCR model |
max_tokens | int | Maximum number of tokens to generate |
temperature | float | Sampling temperature (0.0 = greedy, recommended for OCR) |
model.batch_transcribe()
Transcribe text from a batch of images. Processes images sequentially and returns a list of transcription strings.
| Parameter | Type | Description |
|---|---|---|
images | list[str] or list[PIL.Image] | List of image paths or PIL Image objects |
prompt | str | Instruction prompt applied to all images |
max_tokens | int | Maximum tokens per transcription |
model.evaluate()
Evaluate model transcription quality against ground truth references. Returns a dictionary with CER, WER, and exact match scores.
| Parameter | Type | Description |
|---|---|---|
images | list[str] or list[PIL.Image] | Images to transcribe |
references | list[str] | Ground truth text for each image |
prompt | str | Instruction prompt |
Returns:
{
"cer": 0.032, # Character Error Rate (lower is better)
"wer": 0.087, # Word Error Rate (lower is better)
"exact_match": 0.85, # Fraction of perfect transcriptions
"num_samples": 100,
}
model.save_pretrained()
Save LoRA adapters to disk. Writes adapters.safetensors, adapter_config.json, and config.json.
model.load_adapter()
Load previously saved LoRA adapters into the model.
model.save_pretrained_merged()
Fuse LoRA weights into the base model and save the full merged model.
Training
mlx_tune.ocrOCRSFTTrainer
Supervised fine-tuning trainer for OCR models. Handles forward pass, loss computation, and gradient updates with OCR-optimized defaults.
| Parameter | Type | Description |
|---|---|---|
model | OCRModelWrapper | Model with LoRA adapters configured |
tokenizer | Processor | Processor from FastOCRModel.from_pretrained() |
data_collator | OCRDataCollator | Data collator for image/text batching |
train_dataset | Dataset | HuggingFace dataset with image and text fields |
args | OCRSFTConfig | Training configuration |
OCRSFTConfig
Training configuration for OCR fine-tuning.
| Parameter | Default | Description |
|---|---|---|
learning_rate | 5e-5 | Peak learning rate. Lower than VLM default since OCR models converge quickly |
max_length | 4096 | Maximum sequence length for OCR outputs (long documents need high values) |
max_steps | 100 | Total training steps |
per_device_train_batch_size | 1 | Batch size (forced to 1, same as VLM) |
output_dir | "./ocr_outputs" | Directory for checkpoints and logs |
train_on_completions | True | Compute loss only on transcription tokens, not the prompt |
gradient_accumulation_steps | 4 | Steps to accumulate gradients before updating |
logging_steps | 5 | Log metrics every N steps |
save_steps | 50 | Save checkpoint every N steps |
OCRDataCollator
Data collator for OCR tasks. Handles image preprocessing, prompt formatting, and token preparation for training.
SFT usage example
from mlx_tune.ocr import FastOCRModel, OCRSFTTrainer, OCRSFTConfig, OCRDataCollator
from datasets import load_dataset
# Load a dedicated OCR model
model, processor = FastOCRModel.from_pretrained(
"deepseek-ai/DeepSeek-OCR",
max_seq_length=4096,
)
# Add LoRA (language layers only by default)
model = FastOCRModel.get_peft_model(model, r=16, lora_alpha=16)
# Load your OCR dataset (image + text pairs)
dataset = load_dataset("your-ocr-dataset", split="train[:200]")
# Train
FastOCRModel.for_training(model)
collator = OCRDataCollator(model, processor)
trainer = OCRSFTTrainer(
model=model, tokenizer=processor,
data_collator=collator, train_dataset=dataset,
args=OCRSFTConfig(output_dir="./ocr_output", max_steps=60),
)
trainer.train()
# Evaluate
FastOCRModel.for_inference(model)
metrics = model.evaluate(test_images, test_references)
print(f"CER: {metrics['cer']:.3f}, WER: {metrics['wer']:.3f}")
OCRGRPOTrainer
Group Relative Policy Optimization (GRPO) trainer for OCR models. Generates multiple transcriptions per image, scores them with character error rate-based reward functions, and updates the model to favor more accurate transcriptions.
| Parameter | Type | Description |
|---|---|---|
model | OCRModelWrapper | Model with LoRA adapters from FastOCRModel.get_peft_model() |
train_dataset | list[dict] | List of dicts with image (path or PIL Image) and text (ground truth) keys |
processor | Processor | Processor from FastOCRModel.from_pretrained() |
reward_fn | Callable | Function (transcription, ground_truth) → float that scores each transcription |
args | OCRGRPOConfig | Training configuration |
OCRGRPOConfig
Configuration for OCR GRPO training.
| Parameter | Default | Description |
|---|---|---|
beta | 0.04 | KL penalty coefficient |
num_generations | 4 | Transcriptions generated per image for advantage estimation |
temperature | 0.7 | Sampling temperature for generation |
max_completion_length | 2048 | Maximum tokens per transcription |
output_dir | "./ocr_grpo_outputs" | Directory for checkpoints |
learning_rate | 1e-6 | Learning rate (lower than SFT) |
max_steps | -1 | -1 trains for one full epoch |
logging_steps | 1 | Log every N steps |
save_steps | 100 | Save checkpoint every N steps |
For each document image, the trainer generates num_generations transcription attempts at non-zero temperature. Each transcription is scored by the reward_fn (typically based on CER against ground truth). The model is then updated via policy gradients to favor more accurate transcriptions while a KL penalty keeps it close to the reference policy.
GRPO usage example
from mlx_tune.ocr import FastOCRModel, OCRGRPOTrainer, OCRGRPOConfig, cer_reward
# Load and configure model
model, processor = FastOCRModel.from_pretrained("deepseek-ai/DeepSeek-OCR")
model = FastOCRModel.get_peft_model(model, r=16, lora_alpha=16)
# Prepare dataset: list of dicts with image and ground truth text
ocr_data = [
{"image": "receipt_001.png", "text": "Total: $42.50\nTax: $3.19"},
{"image": "receipt_002.png", "text": "Subtotal: $18.00\nTip: $3.60"},
# ...
]
# Train with CER-based rewards
FastOCRModel.for_training(model)
trainer = OCRGRPOTrainer(
model=model,
train_dataset=ocr_data,
processor=processor,
reward_fn=cer_reward,
args=OCRGRPOConfig(num_generations=4, max_steps=20),
)
result = trainer.train()
Evaluation Metrics
mlx-tune provides three standard OCR evaluation metrics, computed by model.evaluate() or available as standalone functions.
Character Error Rate (CER)
The edit distance between predicted and reference text, normalized by reference length. The primary metric for OCR quality.
CER = edit_distance(prediction, reference) / len(reference)
# Example:
# prediction: "Helo World"
# reference: "Hello World"
# edit_distance = 1 (missing 'l')
# CER = 1 / 11 = 0.091
Range: 0.0 (perfect) to 1.0+ (CER can exceed 1.0 if the prediction is much longer than the reference).
Word Error Rate (WER)
The edit distance computed at the word level. More interpretable for natural language but less granular than CER.
WER = edit_distance(prediction_words, reference_words) / len(reference_words)
# Example:
# prediction: "Hello Wrold"
# reference: "Hello World"
# word edit_distance = 1 ("Wrold" != "World")
# WER = 1 / 2 = 0.50
Exact Match
The fraction of samples where the prediction matches the reference exactly (after whitespace normalization).
exact_match = sum(pred.strip() == ref.strip() for pred, ref in pairs) / len(pairs)
# Example with 4 samples:
# 3 exact matches out of 4
# exact_match = 0.75
Reward Functions
Built-in reward functions for OCR GRPO training. These can be used directly with OCRGRPOTrainer or combined into custom reward functions.
cer_reward
Returns 1.0 - CER, clamped to [0.0, 1.0]. A perfect transcription scores 1.0; a completely wrong one scores 0.0.
from mlx_tune.ocr import cer_reward
score = cer_reward("Hello World", "Hello World") # 1.0
score = cer_reward("Helo World", "Hello World") # 0.909
score = cer_reward("xyz", "Hello World") # ~0.27
exact_match_reward
Returns 1.0 if the transcription exactly matches the ground truth (after whitespace normalization), 0.0 otherwise. A strict binary reward.
from mlx_tune.ocr import exact_match_reward
score = exact_match_reward("Hello World", "Hello World") # 1.0
score = exact_match_reward("Hello World ", "Hello World") # 1.0 (whitespace normalized)
score = exact_match_reward("Helo World", "Hello World") # 0.0
combined_ocr_reward
Weighted combination of CER reward and exact match reward. Balances continuous improvement (CER) with the incentive for perfect transcriptions (exact match).
from mlx_tune.ocr import combined_ocr_reward
# Combine CER (70%) and exact match (30%)
score = combined_ocr_reward("Hello World", "Hello World") # 1.0
score = combined_ocr_reward("Helo World", "Hello World") # 0.636
# Custom weights
score = combined_ocr_reward("Helo World", "Hello World",
cer_weight=0.5, exact_weight=0.5) # 0.455
Custom reward function
You can write your own reward function for domain-specific OCR tasks.
import re
def receipt_reward(transcription, ground_truth):
"""Reward that prioritizes getting dollar amounts correct."""
# Extract dollar amounts from both strings
pred_amounts = set(re.findall(r'\$[\d.]+', transcription))
true_amounts = set(re.findall(r'\$[\d.]+', ground_truth))
if not true_amounts:
return cer_reward(transcription, ground_truth)
# 60% weight on amount accuracy, 40% on overall CER
amount_score = len(pred_amounts & true_amounts) / len(true_amounts)
cer_score = max(0.0, 1.0 - cer(transcription, ground_truth))
return 0.6 * amount_score + 0.4 * cer_score
Save, load, and merge
After training, save adapters for later use, reload them, or merge LoRA weights into the base model.
Save adapters
# Save LoRA adapters only (small files)
model.save_pretrained("./ocr_adapters")
Load adapters
# Load adapters into a fresh model
model, processor = FastOCRModel.from_pretrained("deepseek-ai/DeepSeek-OCR")
model.load_adapter("./ocr_adapters")
Merge LoRA into base model
# Fuse LoRA weights and save full model
model.save_pretrained_merged("./ocr_merged", processor)
Evaluate after loading
# Evaluate a saved model
model, processor = FastOCRModel.from_pretrained("deepseek-ai/DeepSeek-OCR")
model.load_adapter("./ocr_adapters")
FastOCRModel.for_inference(model)
metrics = model.evaluate(test_images, test_references)
print(f"CER: {metrics['cer']:.3f}, WER: {metrics['wer']:.3f}, "
f"Exact Match: {metrics['exact_match']:.1%}")
Working examples
Complete scripts you can run directly.
33 — DeepSeek-OCR Fine-Tuning
Fine-tune DeepSeek-OCR on document images with LoRA. Includes evaluation with CER/WER metrics.
34 — GLM-OCR Fine-Tuning
Fine-tune GLM-OCR for specialized document types. Receipt and invoice transcription with custom prompts.
35 — OCR GRPO Training
Train an OCR model with GRPO using CER-based rewards. Directly optimizes transcription accuracy via reinforcement learning.
36 — olmOCR-2 Fine-Tuning
Fine-tune the 7B olmOCR-2 model for complex document understanding. Tables, forms, and multi-column layouts.
37 — OCR Evaluation Pipeline
Complete evaluation pipeline: load a fine-tuned model, run batch transcription, compute CER/WER/exact match, and compare against baseline.
Best practices
Set temperature=0.0 for transcription and evaluation. Non-zero temperature introduces randomness that reduces OCR accuracy. Only use non-zero temperature during GRPO training where diversity is needed for advantage estimation.
Default max_length=4096 is sufficient for most single-page documents. For multi-page or dense documents, increase to 8192 or higher. Monitor for truncation in your evaluation metrics.
Dedicated OCR models (DeepSeek-OCR, GLM-OCR) are smaller and faster but focused on text extraction. General VLMs (Qwen3.5, Qwen2.5-VL) are larger but can handle complex tasks like table understanding, form field extraction, and document reasoning. Choose based on your task complexity.
OCR models work best with clean, well-lit images. For degraded scans, consider preprocessing with contrast enhancement or deskewing before training and inference. The processor handles resizing automatically.
Dedicated OCR models (0.9B–1B) run comfortably on 8 GB unified memory. The 7B olmOCR-2 in 4-bit quantization needs 16 GB+. General VLMs follow the same memory guidelines as the VLM track.
Like VLM training, OCR training is forced to batch_size=1 since each document image produces a different number of vision tokens. Use gradient_accumulation_steps to simulate larger effective batch sizes.