Unsloth → MLX-Tune Translation Guide

Everything you learned from Unsloth tutorials works on your Mac. This page shows exactly what to change—and what stays the same.

Three steps to run on Mac

Take any Unsloth tutorial and make it run on Apple Silicon.

1

Pick an Unsloth Notebook

Find a tutorial from Unsloth’s docs or Colab notebooks. Any SFT, DPO, ORPO, or GRPO example will work.

2

Change Imports & Model

Swap the import lines and use mlx-community model names instead of unsloth/ models.

3

Run on Your Mac

Execute locally, iterate fast, then scale to CUDA with the original Unsloth when ready.

What to change in your imports

Every import maps one-to-one. Replace the left column with the right column.

Unsloth / TRL MLX-Tune
from unsloth import FastLanguageModel from mlx_tune import FastLanguageModel
from trl import SFTTrainer, SFTConfig from mlx_tune import SFTTrainer, SFTConfig
from trl import DPOTrainer, DPOConfig from mlx_tune import DPOTrainer, DPOConfig
from trl import ORPOTrainer, ORPOConfig from mlx_tune import ORPOTrainer, ORPOConfig
from trl import GRPOTrainer, GRPOConfig from mlx_tune import GRPOTrainer, GRPOConfig
from unsloth import FastVisionModel from mlx_tune import FastVisionModel
from unsloth.trainer import UnslothVisionDataCollator from mlx_tune import UnslothVisionDataCollator
from unsloth import get_chat_template from mlx_tune import get_chat_template
from unsloth import train_on_responses_only from mlx_tune import train_on_responses_only
from unsloth import to_sharegpt from mlx_tune import to_sharegpt

HuggingFace model mapping

Unsloth uses models from the unsloth/ org on HuggingFace. MLX-Tune uses pre-converted models from the mlx-community/ org, which are optimized for Apple’s MLX framework.

Unsloth Model MLX-Tune Equivalent
unsloth/Meta-Llama-3.1-8B-bnb-4bit mlx-community/Llama-3.2-8B-Instruct-4bit
unsloth/Qwen2.5-7B-bnb-4bit mlx-community/Qwen2.5-7B-Instruct-4bit
unsloth/gemma-2-9b-it-bnb-4bit mlx-community/gemma-2-9b-it-4bit
unsloth/Phi-4-bnb-4bit mlx-community/Phi-4-4bit
unsloth/mistral-7b-v0.3-bnb-4bit mlx-community/Mistral-7B-Instruct-v0.3-4bit
Qwen/Qwen3.5-0.8B mlx-community/Qwen3.5-0.8B-bf16
Tip

Find MLX models at huggingface.co/mlx-community. Most popular models are available in 4-bit and 8-bit quantizations.

What changes in training config

Most parameters are identical. A few CUDA-specific options are either replaced or no longer needed.

Parameter Unsloth MLX-Tune Notes
per_device_train_batch_size Same Same
gradient_accumulation_steps Same Same
learning_rate Same Same
max_steps Same Same
optim "adamw_8bit" "adam" MLX uses standard Adam
fp16 / bf16 True Not needed MLX handles precision automatically
device_map "auto" Not needed No device mapping on Apple Silicon
dataset_num_proc 2 Not needed Single-process on Mac

Actual Unsloth notebook → MLX-Tune

Here’s Unsloth’s Qwen3.5 Vision notebook converted to MLX-Tune. Only the highlighted lines change.

qwen35_vision_finetuning.py Unsloth → MLX-Tune changes
@@ Imports @@
from unsloth import FastVisionModel
from unsloth.trainer import UnslothVisionDataCollator
from trl import SFTTrainer, SFTConfig
import torch
from mlx_tune import FastVisionModel, UnslothVisionDataCollator, VLMSFTTrainer
from mlx_tune.vlm import VLMSFTConfig
@@ Model Loading @@
model, tokenizer = FastVisionModel.from_pretrained(
"unsloth/Qwen3.5-0.8B",
"mlx-community/Qwen3.5-0.8B-bf16",
load_in_4bit = False,
use_gradient_checkpointing = "unsloth",
)
@@ LoRA Setup — identical! @@
model = FastVisionModel.get_peft_model(
model,
finetune_vision_layers = True,
finetune_language_layers = True,
r = 16, lora_alpha = 16,
)
@@ Dataset — identical! @@
dataset = load_dataset("unsloth/LaTeX_OCR", split="train")
converted_dataset = [convert_to_conversation(s) for s in dataset]
@@ Training @@
FastVisionModel.for_training(model)
trainer = SFTTrainer(
trainer = VLMSFTTrainer(
model = model,
tokenizer = tokenizer,
data_collator = UnslothVisionDataCollator(model, tokenizer),
train_dataset = converted_dataset,
args = SFTConfig(
args = VLMSFTConfig(
per_device_train_batch_size = 2,
per_device_train_batch_size = 1, # Must be 1 for VLM
gradient_accumulation_steps = 4,
warmup_steps = 5,
max_steps = 30,
learning_rate = 2e-4,
optim = "adamw_8bit",
optim = "adam",
weight_decay = 0.001,
lr_scheduler_type = "linear",
seed = 3407,
output_dir = "outputs",
),
)
trainer.train()
@@ Saving — identical! @@
model.save_pretrained("qwen_lora")
Summary

Out of ~40 lines of training code, only 8 lines change: imports (2), model name (1), trainer class (1), config class (1), batch size (1), optimizer (1), and removing torch import (1). Everything else — LoRA config, dataset prep, data collator, save — is identical.

Complete SFT script comparison

A full training script, side by side. The highlighted lines are the only differences.

Unsloth (CUDA)
from unsloth import FastLanguageModel
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset

model, tokenizer = FastLanguageModel.from_pretrained(
    "unsloth/Meta-Llama-3.1-8B-bnb-4bit",
    max_seq_length=2048,
    load_in_4bit=True,
)

model = FastLanguageModel.get_peft_model(
    model, r=16, lora_alpha=16,
    target_modules=["q_proj", "k_proj",
                     "v_proj", "o_proj"],
)

dataset = load_dataset(
    "yahma/alpaca-cleaned", split="train"
)

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    tokenizer=tokenizer,
    args=SFTConfig(
        output_dir="outputs",
        per_device_train_batch_size=2,
        learning_rate=2e-4,
        max_steps=100,
        optim="adamw_8bit",
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
    ),
)
trainer.train()
model.save_pretrained("lora_model")
MLX-Tune (Apple Silicon)
from mlx_tune import FastLanguageModel
from mlx_tune import SFTTrainer, SFTConfig
from datasets import load_dataset

model, tokenizer = FastLanguageModel.from_pretrained(
    "mlx-community/Llama-3.2-8B-Instruct-4bit",
    max_seq_length=2048,
    load_in_4bit=True,
)

model = FastLanguageModel.get_peft_model(
    model, r=16, lora_alpha=16,
    target_modules=["q_proj", "k_proj",
                     "v_proj", "o_proj"],
)

dataset = load_dataset(
    "yahma/alpaca-cleaned", split="train"
)

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    tokenizer=tokenizer,
    args=SFTConfig(
        output_dir="outputs",
        per_device_train_batch_size=2,
        learning_rate=2e-4,
        max_steps=100,
        optim="adam",
    ),
)
trainer.train()
model.save_pretrained("lora_model")

VLM fine-tuning translation

Vision-Language Model fine-tuning follows the same pattern, with a few MLX-Tune-specific additions.

Import changes

Unsloth (CUDA)
from unsloth import FastVisionModel
from unsloth.trainer import UnslothVisionDataCollator
from trl import SFTTrainer, SFTConfig
MLX-Tune (Apple Silicon)
from mlx_tune import FastVisionModel
from mlx_tune import UnslothVisionDataCollator
from mlx_tune.vlm import VLMSFTTrainer, VLMSFTConfig

Key differences for VLM

Aspect Unsloth MLX-Tune
Trainer class SFTTrainer VLMSFTTrainer
Config class SFTConfig VLMSFTConfig
Batch size Flexible Must be 1
Data collator UnslothVisionDataCollator UnslothVisionDataCollator
Important

VLM training in MLX-Tune must use per_device_train_batch_size=1. Images produce variable numbers of vision tokens, so batching is not supported. Use gradient_accumulation_steps to simulate larger effective batch sizes.

Everything else is identical

The vast majority of your Unsloth code works without any changes.

  • get_peft_model() parametersr, lora_alpha, target_modules, lora_dropout, and all other LoRA settings
  • Dataset formats — Alpaca, ShareGPT, and ChatML are auto-detected and converted
  • Chat templatesget_chat_template() with the same template names (llama-3, qwen2.5, gemma, phi-4, mistral, etc.)
  • Response-only trainingtrain_on_responses_only() with the same instruction_part and response_part parameters
  • Save methodssave_pretrained(), save_pretrained_merged(), and save_pretrained_gguf()
  • LoRA configuration — Adapter loading and saving is fully compatible
  • Dataset utilitiesto_sharegpt(), apply_column_mapping(), and HFDatasetConfig
  • InferenceFastLanguageModel.for_inference(model) and streaming generation
The Bottom Line

If you can fine-tune with Unsloth, you can fine-tune with MLX-Tune. Change the imports, swap the model name, drop the CUDA-specific config—and you’re training on your Mac.