Performance

Training speed and memory you can expect, plus the knobs you can use when defaults don’t fit your workload.

What to expect on M4 Pro 48 GB

These numbers come from our own runs on an M4 Pro 48 GB Mac. They’re rough indicators, not benchmarks — treat them as “starting expectations.” Your numbers will move with model size, quantization, context length, batch size, dataset shape, and macOS version. All rows below use LoRA rank 8–16.

Workload Model Context Batch Speed Peak GB Notes
SFT Qwen3–0.6B–bf16 10241 ~2.2 it/s2.6 16 LoRA layers (q, v), no checkpointing
SFT Qwen3–0.6B–bf16 40961 ~0.28 it/s18 full LoRA, no checkpointing
DPO Qwen3.5–0.8B–MLX–4bit 5121 ~540 ms/sample~1 shares the prompt forward between chosen & rejected
DPO Qwen3.5–0.8B–MLX–4bit 5122 ~810 ms/sample~1.2 batched (prompt-share applies only at batch size 1)
ORPO Qwen3.5–0.8B–MLX–4bit 5121 ~430 ms/sample~1 shares the prompt forward between chosen & rejected
ORPO Qwen3–0.6B–bf16 40961 ~0.09 it/s9.2 set use_gradient_checkpointing="unsloth"
ORPO Qwen3–0.6B–bf16 20481 ~0.41 it/s16.8 no checkpointing needed
GRPO Qwen3–0.6B–bf16 40961 ~0.11 it/s5.3 4 generations × 256 tokens per step
GRPO Qwen3.5–0.8B–4bit ~128–5121 ~4.8 s/iter15.9 4 generations × 128 tokens per step
Embedding MiniLM–L6–v2–bf16 (22M) 12832 ~25 ms/step<0.5 InfoNCE contrastive
How to read this

it/s is full step time — forward, backward, optimizer, and per-step overhead. ms/sample divides wall time by the total samples processed across the run, which is the fair number when comparing batch sizes. Peak GB is the high-water mark over the whole run via mx.get_peak_memory() — not the steady-state working set.

Scaling down to smaller Macs

As a rule of thumb, halving the available memory means halving the context length you can comfortably train at, for the same model class. If you’re on 24 GB, the M4 Pro 48 GB rows above should run at roughly half the maximum context; on 16 GB, around a quarter. Below 16 GB, stick to 4-bit models in the 0.5–1B range at context length 1024 or shorter.

A note on DPO/ORPO at batch size 1

At per_device_train_batch_size=1, the trainer forwards the prompt once and reuses the result for both the chosen and rejected branches. The saving scales with how much of the sequence is prompt vs. response — the rows above are at roughly equal prompt and response length. With very short prompts the speedup is modest; with very long prompts (multi-turn or reasoning data), it gets larger.

Performance knobs

Most things are already on by default. These are the levers worth knowing about when defaults don’t fit.

Gradient checkpointing

Trades roughly 2× backward time for about half the activation memory. Off by default because most short-context runs don’t need it. Turn it on when you hit OOM at long context or with a large model:

model = FastLanguageModel.get_peft_model(
    model, r=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    use_gradient_checkpointing="unsloth",  # opt in for ctx >= 4096 or 7B+
)

This flag is now respected by every trainer (not just SFT — it used to silently no-op for DPO, ORPO, KTO, SimPO, GRPO, CPT, and the audio/embedding/VLM trainers).

SFT validation cost

The default is light — 5 validation batches per evaluation pass, evaluated every max(save_steps, 200) steps. Set val_batches=0 to skip evaluation entirely on benchmarking or short runs:

SFTConfig(
    val_batches=5,        # default; set 0 to skip eval
    steps_per_eval=200,   # default = max(save_steps, 200)
)

DPO reference model

DPO now uses a frozen reference policy by default — the chosen/rejected log-probabilities of the base model are computed once at the start and reused for the whole run. This matches the standard DPO formulation. If you want the older behaviour (using the policy itself with stop_gradient as a stand-in reference), opt out:

DPOConfig(
    precompute_ref_logprobs=True,   # default; False for legacy stop-grad reference
)

Environment variables

Two escape hatches for debugging or A/B testing:

VariableEffect
MLX_TUNE_DISABLE_COMPILE=1Disable @mx.compile globally and run trainers eagerly. Useful for narrowing down a compile-related issue or comparing eager vs. compiled wall time.
MLX_TUNE_BUCKET_SIZE=NOverride the default 64-token padding bucket used by collators. Set to 1 to effectively disable bucketing.

What’s automatic

You don’t need to think about any of this — it’s already happening underneath. Mentioned here so you know what you’re getting: