OCR Fine-Tuning

Fine-tune OCR models on Apple Silicon. mlx-tune supports both dedicated OCR architectures and general VLMs adapted for document understanding — all with native LoRA, built-in evaluation metrics, and CER-based GRPO training.

Two tracks for OCR

mlx-tune provides two complementary approaches to OCR fine-tuning, depending on your accuracy and latency requirements.

Dedicated OCR Models

Purpose-built for text recognition. These models (DeepSeek-OCR, GLM-OCR, DOTS-OCR) are small, fast, and optimized for document transcription. They achieve high accuracy on structured text (invoices, receipts, printed documents) while running efficiently on Apple Silicon.

General VLM → OCR

Large vision-language models like Qwen3.5, Qwen2.5-VL, and Pixtral can be fine-tuned for OCR tasks using the same VLM pipeline. This approach excels on complex layouts, handwriting, and documents requiring reasoning (e.g., form understanding, table extraction). Use FastVisionModel from the VLM track for these models.

What makes mlx-tune OCR special

  • Apple Silicon native — No CUDA required. Train and evaluate OCR models entirely on your Mac using the MLX framework.
  • LoRA fine-tuning — Efficient adapter-based training. Fine-tune dedicated OCR models with as little as 8 GB unified memory.
  • Built-in evaluation — CER, WER, and exact match metrics computed automatically during training and on demand.
  • GRPO with OCR rewards — Train with character error rate-based reward functions to directly optimize transcription quality.
  • Batch transcription — Process entire document sets efficiently with batch_transcribe().

Supported OCR Models

Dedicated OCR models supported by FastOCRModel, plus general VLMs that can be fine-tuned for OCR via the VLM track.

Dedicated OCR Models

ModelParametersTypeQuantization
DeepSeek-OCR0.9BDedicated4-bit, bf16
DeepSeek-OCR-21BDedicated4-bit, bf16
GLM-OCR0.9BDedicated4-bit, bf16
DOTS-OCRvariesDedicated4-bit, bf16

Derived OCR Models

ModelParametersTypeQuantization
olmOCR-27BVLM-derived4-bit, 8-bit
LightOnOCR-1B1BVLM-derived4-bit, bf16

General VLMs for OCR

ModelParametersTypeQuantization
Qwen3.50.8B–32BGeneral VLM4-bit, 8-bit, bf16
Qwen2.5-VL3B–72BGeneral VLM4-bit, 8-bit, bf16
Pixtral12BGeneral VLM4-bit, 8-bit
Which track to use?

Use FastOCRModel for dedicated and derived OCR models listed above. For general VLMs (Qwen3.5, Qwen2.5-VL, Pixtral), use FastVisionModel from the VLM track — they share the same training pipeline.

FastOCRModel

mlx_tune.ocr

FastOCRModel.from_pretrained()

FastOCRModel.from_pretrained(model_name, max_seq_length=4096, load_in_4bit=False, ...) → Tuple[OCRModelWrapper, Processor]

Load a dedicated OCR model from HuggingFace. Returns a model wrapper and a processor that handles both image preprocessing and text tokenization.

ParameterTypeDescription
model_namestrHuggingFace model ID (e.g., "deepseek-ai/DeepSeek-OCR") or local path
max_seq_lengthintMaximum sequence length. OCR outputs can be long; 4096 is recommended
load_in_4bitboolLoad model with 4-bit quantization (reduces memory by ~75%)

FastOCRModel.get_peft_model()

FastOCRModel.get_peft_model(model, finetune_vision_layers=False, finetune_language_layers=True, r=16, lora_alpha=16, ...) → OCRModelWrapper

Add LoRA adapters to the OCR model. By default, only the language layers are fine-tuned — the vision encoder is kept frozen since dedicated OCR models already have strong visual features.

ParameterTypeDefaultDescription
finetune_vision_layersboolFalseApply LoRA to vision encoder. Usually not needed for dedicated OCR models
finetune_language_layersboolTrueApply LoRA to language model layers
rint16LoRA rank
lora_alphaint16LoRA scaling factor. Recommended: equal to r
Why vision_layers=False by default?

Dedicated OCR models are pre-trained with vision encoders specifically optimized for document images. Fine-tuning only the language layers is usually sufficient and significantly faster. Enable finetune_vision_layers=True if you are working with unusual image formats (e.g., handwritten notes, degraded scans).

FastOCRModel.for_training()

FastOCRModel.for_training(model)

Enable training mode. Required before starting any training loop.

FastOCRModel.for_inference()

FastOCRModel.for_inference(model)

Enable inference mode. Activates KV caching and disables dropout. Always call before generating or evaluating.

OCRModelWrapper

mlx_tune.ocr

Wrapper returned by FastOCRModel.from_pretrained(). Provides OCR-specific methods for transcription, batch processing, evaluation, and model persistence.

model.transcribe()

model.transcribe(image, prompt="Transcribe this image.", max_tokens=2048, temperature=0.0) → str

Transcribe text from a single image. Returns the extracted text as a string.

ParameterTypeDescription
imagestr or PIL.ImagePath to an image file, or a PIL Image object
promptstrInstruction prompt for the OCR model
max_tokensintMaximum number of tokens to generate
temperaturefloatSampling temperature (0.0 = greedy, recommended for OCR)

model.batch_transcribe()

model.batch_transcribe(images, prompt="Transcribe this image.", max_tokens=2048) → list[str]

Transcribe text from a batch of images. Processes images sequentially and returns a list of transcription strings.

ParameterTypeDescription
imageslist[str] or list[PIL.Image]List of image paths or PIL Image objects
promptstrInstruction prompt applied to all images
max_tokensintMaximum tokens per transcription

model.evaluate()

model.evaluate(images, references, prompt="Transcribe this image.") → dict

Evaluate model transcription quality against ground truth references. Returns a dictionary with CER, WER, and exact match scores.

ParameterTypeDescription
imageslist[str] or list[PIL.Image]Images to transcribe
referenceslist[str]Ground truth text for each image
promptstrInstruction prompt

Returns:

{
    "cer": 0.032,          # Character Error Rate (lower is better)
    "wer": 0.087,          # Word Error Rate (lower is better)
    "exact_match": 0.85,   # Fraction of perfect transcriptions
    "num_samples": 100,
}

model.save_pretrained()

model.save_pretrained(output_dir)

Save LoRA adapters to disk. Writes adapters.safetensors, adapter_config.json, and config.json.

model.load_adapter()

model.load_adapter(adapter_path)

Load previously saved LoRA adapters into the model.

model.save_pretrained_merged()

model.save_pretrained_merged(output_dir, processor)

Fuse LoRA weights into the base model and save the full merged model.

Training

mlx_tune.ocr

OCRSFTTrainer

OCRSFTTrainer(model, tokenizer, data_collator, train_dataset, args=None)

Supervised fine-tuning trainer for OCR models. Handles forward pass, loss computation, and gradient updates with OCR-optimized defaults.

ParameterTypeDescription
modelOCRModelWrapperModel with LoRA adapters configured
tokenizerProcessorProcessor from FastOCRModel.from_pretrained()
data_collatorOCRDataCollatorData collator for image/text batching
train_datasetDatasetHuggingFace dataset with image and text fields
argsOCRSFTConfigTraining configuration
trainer.train() — Start training. Returns training statistics.

OCRSFTConfig

OCRSFTConfig(learning_rate=5e-5, max_length=4096, max_steps=100, per_device_train_batch_size=1, output_dir="./ocr_outputs", train_on_completions=True, gradient_accumulation_steps=4, logging_steps=5, save_steps=50, ...)

Training configuration for OCR fine-tuning.

ParameterDefaultDescription
learning_rate5e-5Peak learning rate. Lower than VLM default since OCR models converge quickly
max_length4096Maximum sequence length for OCR outputs (long documents need high values)
max_steps100Total training steps
per_device_train_batch_size1Batch size (forced to 1, same as VLM)
output_dir"./ocr_outputs"Directory for checkpoints and logs
train_on_completionsTrueCompute loss only on transcription tokens, not the prompt
gradient_accumulation_steps4Steps to accumulate gradients before updating
logging_steps5Log metrics every N steps
save_steps50Save checkpoint every N steps

OCRDataCollator

OCRDataCollator(model, processor)

Data collator for OCR tasks. Handles image preprocessing, prompt formatting, and token preparation for training.

SFT usage example

from mlx_tune.ocr import FastOCRModel, OCRSFTTrainer, OCRSFTConfig, OCRDataCollator
from datasets import load_dataset

# Load a dedicated OCR model
model, processor = FastOCRModel.from_pretrained(
    "deepseek-ai/DeepSeek-OCR",
    max_seq_length=4096,
)

# Add LoRA (language layers only by default)
model = FastOCRModel.get_peft_model(model, r=16, lora_alpha=16)

# Load your OCR dataset (image + text pairs)
dataset = load_dataset("your-ocr-dataset", split="train[:200]")

# Train
FastOCRModel.for_training(model)
collator = OCRDataCollator(model, processor)
trainer = OCRSFTTrainer(
    model=model, tokenizer=processor,
    data_collator=collator, train_dataset=dataset,
    args=OCRSFTConfig(output_dir="./ocr_output", max_steps=60),
)
trainer.train()

# Evaluate
FastOCRModel.for_inference(model)
metrics = model.evaluate(test_images, test_references)
print(f"CER: {metrics['cer']:.3f}, WER: {metrics['wer']:.3f}")

OCRGRPOTrainer

OCRGRPOTrainer(model, train_dataset, processor, reward_fn, args=None)

Group Relative Policy Optimization (GRPO) trainer for OCR models. Generates multiple transcriptions per image, scores them with character error rate-based reward functions, and updates the model to favor more accurate transcriptions.

ParameterTypeDescription
modelOCRModelWrapperModel with LoRA adapters from FastOCRModel.get_peft_model()
train_datasetlist[dict]List of dicts with image (path or PIL Image) and text (ground truth) keys
processorProcessorProcessor from FastOCRModel.from_pretrained()
reward_fnCallableFunction (transcription, ground_truth) → float that scores each transcription
argsOCRGRPOConfigTraining configuration
trainer.train() — Start GRPO training. Returns training statistics.

OCRGRPOConfig

OCRGRPOConfig(beta=0.04, num_generations=4, temperature=0.7, max_completion_length=2048, output_dir="./ocr_grpo_outputs", learning_rate=1e-6, max_steps=-1, logging_steps=1, save_steps=100)

Configuration for OCR GRPO training.

ParameterDefaultDescription
beta0.04KL penalty coefficient
num_generations4Transcriptions generated per image for advantage estimation
temperature0.7Sampling temperature for generation
max_completion_length2048Maximum tokens per transcription
output_dir"./ocr_grpo_outputs"Directory for checkpoints
learning_rate1e-6Learning rate (lower than SFT)
max_steps-1-1 trains for one full epoch
logging_steps1Log every N steps
save_steps100Save checkpoint every N steps
How OCR GRPO works

For each document image, the trainer generates num_generations transcription attempts at non-zero temperature. Each transcription is scored by the reward_fn (typically based on CER against ground truth). The model is then updated via policy gradients to favor more accurate transcriptions while a KL penalty keeps it close to the reference policy.

GRPO usage example

from mlx_tune.ocr import FastOCRModel, OCRGRPOTrainer, OCRGRPOConfig, cer_reward

# Load and configure model
model, processor = FastOCRModel.from_pretrained("deepseek-ai/DeepSeek-OCR")
model = FastOCRModel.get_peft_model(model, r=16, lora_alpha=16)

# Prepare dataset: list of dicts with image and ground truth text
ocr_data = [
    {"image": "receipt_001.png", "text": "Total: $42.50\nTax: $3.19"},
    {"image": "receipt_002.png", "text": "Subtotal: $18.00\nTip: $3.60"},
    # ...
]

# Train with CER-based rewards
FastOCRModel.for_training(model)
trainer = OCRGRPOTrainer(
    model=model,
    train_dataset=ocr_data,
    processor=processor,
    reward_fn=cer_reward,
    args=OCRGRPOConfig(num_generations=4, max_steps=20),
)
result = trainer.train()

Evaluation Metrics

mlx-tune provides three standard OCR evaluation metrics, computed by model.evaluate() or available as standalone functions.

Character Error Rate (CER)

The edit distance between predicted and reference text, normalized by reference length. The primary metric for OCR quality.

CER = edit_distance(prediction, reference) / len(reference)

# Example:
# prediction: "Helo World"
# reference:  "Hello World"
# edit_distance = 1 (missing 'l')
# CER = 1 / 11 = 0.091

Range: 0.0 (perfect) to 1.0+ (CER can exceed 1.0 if the prediction is much longer than the reference).

Word Error Rate (WER)

The edit distance computed at the word level. More interpretable for natural language but less granular than CER.

WER = edit_distance(prediction_words, reference_words) / len(reference_words)

# Example:
# prediction: "Hello Wrold"
# reference:  "Hello World"
# word edit_distance = 1 ("Wrold" != "World")
# WER = 1 / 2 = 0.50

Exact Match

The fraction of samples where the prediction matches the reference exactly (after whitespace normalization).

exact_match = sum(pred.strip() == ref.strip() for pred, ref in pairs) / len(pairs)

# Example with 4 samples:
# 3 exact matches out of 4
# exact_match = 0.75

Reward Functions

Built-in reward functions for OCR GRPO training. These can be used directly with OCRGRPOTrainer or combined into custom reward functions.

cer_reward

cer_reward(transcription, ground_truth) → float

Returns 1.0 - CER, clamped to [0.0, 1.0]. A perfect transcription scores 1.0; a completely wrong one scores 0.0.

from mlx_tune.ocr import cer_reward

score = cer_reward("Hello World", "Hello World")   # 1.0
score = cer_reward("Helo World", "Hello World")     # 0.909
score = cer_reward("xyz", "Hello World")            # ~0.27

exact_match_reward

exact_match_reward(transcription, ground_truth) → float

Returns 1.0 if the transcription exactly matches the ground truth (after whitespace normalization), 0.0 otherwise. A strict binary reward.

from mlx_tune.ocr import exact_match_reward

score = exact_match_reward("Hello World", "Hello World")   # 1.0
score = exact_match_reward("Hello World ", "Hello World")   # 1.0 (whitespace normalized)
score = exact_match_reward("Helo World", "Hello World")     # 0.0

combined_ocr_reward

combined_ocr_reward(transcription, ground_truth, cer_weight=0.7, exact_weight=0.3) → float

Weighted combination of CER reward and exact match reward. Balances continuous improvement (CER) with the incentive for perfect transcriptions (exact match).

from mlx_tune.ocr import combined_ocr_reward

# Combine CER (70%) and exact match (30%)
score = combined_ocr_reward("Hello World", "Hello World")   # 1.0
score = combined_ocr_reward("Helo World", "Hello World")     # 0.636

# Custom weights
score = combined_ocr_reward("Helo World", "Hello World",
                            cer_weight=0.5, exact_weight=0.5)  # 0.455

Custom reward function

You can write your own reward function for domain-specific OCR tasks.

import re

def receipt_reward(transcription, ground_truth):
    """Reward that prioritizes getting dollar amounts correct."""
    # Extract dollar amounts from both strings
    pred_amounts = set(re.findall(r'\$[\d.]+', transcription))
    true_amounts = set(re.findall(r'\$[\d.]+', ground_truth))

    if not true_amounts:
        return cer_reward(transcription, ground_truth)

    # 60% weight on amount accuracy, 40% on overall CER
    amount_score = len(pred_amounts & true_amounts) / len(true_amounts)
    cer_score = max(0.0, 1.0 - cer(transcription, ground_truth))
    return 0.6 * amount_score + 0.4 * cer_score

Save, load, and merge

After training, save adapters for later use, reload them, or merge LoRA weights into the base model.

Save adapters

# Save LoRA adapters only (small files)
model.save_pretrained("./ocr_adapters")

Load adapters

# Load adapters into a fresh model
model, processor = FastOCRModel.from_pretrained("deepseek-ai/DeepSeek-OCR")
model.load_adapter("./ocr_adapters")

Merge LoRA into base model

# Fuse LoRA weights and save full model
model.save_pretrained_merged("./ocr_merged", processor)

Evaluate after loading

# Evaluate a saved model
model, processor = FastOCRModel.from_pretrained("deepseek-ai/DeepSeek-OCR")
model.load_adapter("./ocr_adapters")
FastOCRModel.for_inference(model)

metrics = model.evaluate(test_images, test_references)
print(f"CER: {metrics['cer']:.3f}, WER: {metrics['wer']:.3f}, "
      f"Exact Match: {metrics['exact_match']:.1%}")

Working examples

Complete scripts you can run directly.

33 — DeepSeek-OCR Fine-Tuning

Fine-tune DeepSeek-OCR on document images with LoRA. Includes evaluation with CER/WER metrics.

OCRSFTDeepSeek

View source →

34 — GLM-OCR Fine-Tuning

Fine-tune GLM-OCR for specialized document types. Receipt and invoice transcription with custom prompts.

OCRSFTGLM

View source →

35 — OCR GRPO Training

Train an OCR model with GRPO using CER-based rewards. Directly optimizes transcription accuracy via reinforcement learning.

OCRGRPORL

View source →

36 — olmOCR-2 Fine-Tuning

Fine-tune the 7B olmOCR-2 model for complex document understanding. Tables, forms, and multi-column layouts.

OCRSFT7B

View source →

37 — OCR Evaluation Pipeline

Complete evaluation pipeline: load a fine-tuned model, run batch transcription, compute CER/WER/exact match, and compare against baseline.

OCREvaluationMetrics

View source →

Best practices

Use greedy decoding for OCR

Set temperature=0.0 for transcription and evaluation. Non-zero temperature introduces randomness that reduces OCR accuracy. Only use non-zero temperature during GRPO training where diversity is needed for advantage estimation.

Increase max_length for long documents

Default max_length=4096 is sufficient for most single-page documents. For multi-page or dense documents, increase to 8192 or higher. Monitor for truncation in your evaluation metrics.

Dedicated vs. general VLMs

Dedicated OCR models (DeepSeek-OCR, GLM-OCR) are smaller and faster but focused on text extraction. General VLMs (Qwen3.5, Qwen2.5-VL) are larger but can handle complex tasks like table understanding, form field extraction, and document reasoning. Choose based on your task complexity.

Image preprocessing

OCR models work best with clean, well-lit images. For degraded scans, consider preprocessing with contrast enhancement or deskewing before training and inference. The processor handles resizing automatically.

Memory requirements

Dedicated OCR models (0.9B–1B) run comfortably on 8 GB unified memory. The 7B olmOCR-2 in 4-bit quantization needs 16 GB+. General VLMs follow the same memory guidelines as the VLM track.

Batch size must be 1

Like VLM training, OCR training is forced to batch_size=1 since each document image produces a different number of vision tokens. Use gradient_accumulation_steps to simulate larger effective batch sizes.