Unsloth → mlx-tune Translation Guide

Everything you learned from Unsloth tutorials works on your Mac. This page shows exactly what to change—and what stays the same.

The Workflow

Three steps to run on Mac

Take any Unsloth tutorial and make it run on Apple Silicon.

Pick an Unsloth Notebook

Find a tutorial from Unsloth’s docs or Colab notebooks. Any SFT, DPO, ORPO, or GRPO example will work.

→

Change Imports & Model

Swap the import lines and use mlx-community model names instead of unsloth/ models.

→

Run on Your Mac

Execute locally, iterate fast, then scale to CUDA with the original Unsloth when ready.

Import Translation

What to change in your imports

Every import maps one-to-one. Replace the left column with the right column.

Unsloth / TRL	mlx-tune
`from unsloth import FastLanguageModel`	`from mlx_tune import FastLanguageModel`
`from trl import SFTTrainer, SFTConfig`	`from mlx_tune import SFTTrainer, SFTConfig`
`from trl import DPOTrainer, DPOConfig`	`from mlx_tune import DPOTrainer, DPOConfig`
`from trl import ORPOTrainer, ORPOConfig`	`from mlx_tune import ORPOTrainer, ORPOConfig`
`from trl import GRPOTrainer, GRPOConfig`	`from mlx_tune import GRPOTrainer, GRPOConfig`
`from unsloth import FastVisionModel`	`from mlx_tune import FastVisionModel`
`from unsloth.trainer import UnslothVisionDataCollator`	`from mlx_tune import UnslothVisionDataCollator`
`from unsloth import get_chat_template`	`from mlx_tune import get_chat_template`
`from unsloth import train_on_responses_only`	`from mlx_tune import train_on_responses_only`
`from unsloth import to_sharegpt`	`from mlx_tune import to_sharegpt`

Model Names

HuggingFace model mapping

Unsloth uses models from the unsloth/ org on HuggingFace. mlx-tune uses pre-converted models from the mlx-community/ org, which are optimized for Apple’s MLX framework.

Unsloth Model	mlx-tune Equivalent
`unsloth/Meta-Llama-3.1-8B-bnb-4bit`	`mlx-community/Llama-3.2-8B-Instruct-4bit`
`unsloth/Qwen2.5-7B-bnb-4bit`	`mlx-community/Qwen2.5-7B-Instruct-4bit`
`unsloth/gemma-2-9b-it-bnb-4bit`	`mlx-community/gemma-2-9b-it-4bit`
`unsloth/Phi-4-bnb-4bit`	`mlx-community/Phi-4-4bit`
`unsloth/mistral-7b-v0.3-bnb-4bit`	`mlx-community/Mistral-7B-Instruct-v0.3-4bit`
`Qwen/Qwen3.5-0.8B`	`mlx-community/Qwen3.5-0.8B-bf16`

Tip

Find MLX models at huggingface.co/mlx-community. Most popular models are available in 4-bit and 8-bit quantizations.

Config Differences

What changes in training config

Most parameters are identical. A few CUDA-specific options are either replaced or no longer needed.

Parameter	Unsloth	mlx-tune	Notes
`per_device_train_batch_size`	Same	Same
`gradient_accumulation_steps`	Same	Same
`learning_rate`	Same	Same
`max_steps`	Same	Same
`optim`	`"adamw_8bit"`	`"adam"`	MLX uses standard Adam
`fp16` / `bf16`	`True`	Not needed	MLX handles precision automatically
`device_map`	`"auto"`	Not needed	No device mapping on Apple Silicon
`dataset_num_proc`	`2`	Not needed	Single-process on Mac

Real Example

Actual Unsloth notebook → mlx-tune

Here’s Unsloth’s Qwen3.5 Vision notebook converted to mlx-tune. Only the highlighted lines change.

qwen35_vision_finetuning.py Unsloth → mlx-tune changes

@@ Imports @@
from unsloth import FastVisionModel
from unsloth.trainer import UnslothVisionDataCollator
from trl import SFTTrainer, SFTConfig
import torch
from mlx_tune import FastVisionModel, UnslothVisionDataCollator, VLMSFTTrainer
from mlx_tune.vlm import VLMSFTConfig
@@ Model Loading @@
model, tokenizer = FastVisionModel.from_pretrained(
    "unsloth/Qwen3.5-0.8B",
    "mlx-community/Qwen3.5-0.8B-bf16",
    load_in_4bit = False,
    use_gradient_checkpointing = "unsloth",
)
@@ LoRA Setup — identical! @@
model = FastVisionModel.get_peft_model(
    model,
    finetune_vision_layers     = True,
    finetune_language_layers   = True,
    r = 16, lora_alpha = 16,
)
@@ Dataset — identical! @@
dataset = load_dataset("unsloth/LaTeX_OCR", split="train")
converted_dataset = [convert_to_conversation(s) for s in dataset]
@@ Training @@
FastVisionModel.for_training(model)
trainer = SFTTrainer(
trainer = VLMSFTTrainer(
    model = model,
    tokenizer = tokenizer,
    data_collator = UnslothVisionDataCollator(model, tokenizer),
    train_dataset = converted_dataset,
    args = SFTConfig(
    args = VLMSFTConfig(
        per_device_train_batch_size = 2,
        per_device_train_batch_size = 1,  # Must be 1 for VLM
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 30,
        learning_rate = 2e-4,
        optim = "adamw_8bit",
        optim = "adam",
        weight_decay = 0.001,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)
trainer.train()
@@ Saving — identical! @@
model.save_pretrained("qwen_lora")

Summary

Out of ~40 lines of training code, only 8 lines change: imports (2), model name (1), trainer class (1), config class (1), batch size (1), optimizer (1), and removing torch import (1). Everything else — LoRA config, dataset prep, data collator, save — is identical.

Side-by-Side Example

Complete SFT script comparison

A full training script, side by side. The highlighted lines are the only differences.

Unsloth (CUDA)

from unsloth import FastLanguageModel
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset

model, tokenizer = FastLanguageModel.from_pretrained(
    "unsloth/Meta-Llama-3.1-8B-bnb-4bit",
    max_seq_length=2048,
    load_in_4bit=True,
)

model = FastLanguageModel.get_peft_model(
    model, r=16, lora_alpha=16,
    target_modules=["q_proj", "k_proj",
                     "v_proj", "o_proj"],
)

dataset = load_dataset(
    "yahma/alpaca-cleaned", split="train"
)

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    tokenizer=tokenizer,
    args=SFTConfig(
        output_dir="outputs",
        per_device_train_batch_size=2,
        learning_rate=2e-4,
        max_steps=100,
        optim="adamw_8bit",
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
    ),
)
trainer.train()
model.save_pretrained("lora_model")

mlx-tune (Apple Silicon)

from mlx_tune import FastLanguageModel
from mlx_tune import SFTTrainer, SFTConfig
from datasets import load_dataset

model, tokenizer = FastLanguageModel.from_pretrained(
    "mlx-community/Llama-3.2-8B-Instruct-4bit",
    max_seq_length=2048,
    load_in_4bit=True,
)

model = FastLanguageModel.get_peft_model(
    model, r=16, lora_alpha=16,
    target_modules=["q_proj", "k_proj",
                     "v_proj", "o_proj"],
)

dataset = load_dataset(
    "yahma/alpaca-cleaned", split="train"
)

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    tokenizer=tokenizer,
    args=SFTConfig(
        output_dir="outputs",
        per_device_train_batch_size=2,
        learning_rate=2e-4,
        max_steps=100,
        optim="adam",
    ),
)
trainer.train()
model.save_pretrained("lora_model")

Vision Models

VLM fine-tuning translation

Vision-Language Model fine-tuning follows the same pattern, with a few mlx-tune-specific additions.

Import changes

Unsloth (CUDA)

from unsloth import FastVisionModel
from unsloth.trainer import UnslothVisionDataCollator
from trl import SFTTrainer, SFTConfig

mlx-tune (Apple Silicon)

from mlx_tune import FastVisionModel
from mlx_tune import UnslothVisionDataCollator
from mlx_tune.vlm import VLMSFTTrainer, VLMSFTConfig

Key differences for VLM

Aspect	Unsloth	mlx-tune
Trainer class	`SFTTrainer`	`VLMSFTTrainer`
Config class	`SFTConfig`	`VLMSFTConfig`
Batch size	Flexible	Must be 1
Data collator	`UnslothVisionDataCollator`	`UnslothVisionDataCollator`

Important

VLM training in mlx-tune must use per_device_train_batch_size=1. Images produce variable numbers of vision tokens, so batching is not supported. Use gradient_accumulation_steps to simulate larger effective batch sizes.

What Stays the Same

Everything else is identical

The vast majority of your Unsloth code works without any changes.

get_peft_model() parameters — r, lora_alpha, target_modules, lora_dropout, and all other LoRA settings
Dataset formats — Alpaca, ShareGPT, and ChatML are auto-detected and converted
Chat templates — get_chat_template() with the same template names (llama-3, qwen2.5, gemma, phi-4, mistral, etc.)
Response-only training — train_on_responses_only() with the same instruction_part and response_part parameters
Save methods — save_pretrained(), save_pretrained_merged(), and save_pretrained_gguf()
LoRA configuration — Adapter loading and saving is fully compatible
Dataset utilities — to_sharegpt(), apply_column_mapping(), and HFDatasetConfig
Inference — FastLanguageModel.for_inference(model) and streaming generation

The Bottom Line

If you can fine-tune with Unsloth, you can fine-tune with mlx-tune. Change the imports, swap the model name, drop the CUDA-specific config—and you’re training on your Mac.