LLM Fine-Tuning
Fine-tune language models on Apple Silicon with LoRA. SFT, DPO, GRPO, KTO, SimPO — all natively on MLX.
Full SFT pipeline in 20 lines
Load a model, add LoRA, train, and save — all with the same API you already know from Unsloth.
from mlx_tune import FastLanguageModel, SFTTrainer, SFTConfig
from datasets import load_dataset
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="mlx-community/Llama-3.2-1B-Instruct-4bit",
max_seq_length=2048,
load_in_4bit=True,
)
model = FastLanguageModel.get_peft_model(
model, r=16,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_alpha=16,
)
dataset = load_dataset("yahma/alpaca-cleaned", split="train[:100]")
trainer = SFTTrainer(
model=model, train_dataset=dataset, tokenizer=tokenizer,
args=SFTConfig(output_dir="outputs", max_steps=50, learning_rate=2e-4),
)
trainer.train()
model.save_pretrained("lora_model")
FastLanguageModel
mlx_tune.modelMain entry point for loading and configuring language models. Mirrors Unsloth’s FastLanguageModel API.
FastLanguageModel.from_pretrained()
Load a pretrained language model from HuggingFace or a local path.
| Parameter | Type | Description |
|---|---|---|
model_name | str | HuggingFace model ID (e.g., "mlx-community/Llama-3.2-1B-Instruct-4bit") or local path |
max_seq_length | int | Maximum sequence length for training and inference |
load_in_4bit | bool | Load model with 4-bit quantization (QLoRA) |
load_in_8bit | bool | Load model with 8-bit quantization |
use_gradient_checkpointing | str | bool | Gradient checkpointing mode (accepted for Unsloth compat) |
FastLanguageModel.get_peft_model()
Add LoRA adapters to the model for parameter-efficient fine-tuning.
| Parameter | Type | Description |
|---|---|---|
model | MLXModelWrapper | Model returned by from_pretrained() |
r | int | LoRA rank (higher = more parameters, better quality) |
target_modules | list[str] | Modules to apply LoRA to. Default: ["q_proj", "k_proj", "v_proj", "o_proj"] |
lora_alpha | int | LoRA scaling factor. Recommended: equal to r |
lora_dropout | float | Dropout for LoRA layers |
bias | str | Bias mode: "none", "all", or "lora_only" |
use_rslora | bool | Use rank-stabilized LoRA scaling |
FastLanguageModel.for_training()
Enable training mode: disables KV caching, enables dropout.
FastLanguageModel.for_inference()
Enable inference mode: activates KV caching, disables dropout. Always call before generating.
FastLanguageModel.convert()
Convert a HuggingFace model to MLX format. Optionally quantize and upload.
| Parameter | Type | Description |
|---|---|---|
hf_model | str | HuggingFace model ID (e.g., "meta-llama/Llama-3-8B") |
output_dir | str | Local directory to save the converted model |
quantize | bool | Whether to quantize during conversion |
q_bits | int | Quantization bits (4 or 8) |
dtype | str, optional | Target dtype (e.g., "float16") |
upload_repo | str, optional | HuggingFace repo ID to upload the converted model |
MLXModelWrapper
mlx_tune.modelWrapper providing Unsloth-compatible methods on MLX models. Returned by FastLanguageModel.from_pretrained().
model.generate()
Generate text from a prompt. Call FastLanguageModel.for_inference(model) first.
| Parameter | Type | Description |
|---|---|---|
prompt | str | Input text or formatted chat prompt |
max_tokens | int | Maximum number of tokens to generate |
temperature | float | Sampling temperature (0.0 = greedy) |
top_p | float | Nucleus sampling probability |
min_p | float | Minimum probability filter (recommended: 0.1) |
model.save_pretrained()
Save LoRA adapters only. Writes adapters.safetensors and adapter_config.json.
model.save_pretrained_merged()
Fuse LoRA weights into the base model and save the full merged model.
model.save_pretrained_gguf()
Export to GGUF format for use with Ollama, llama.cpp, and other inference engines.
GGUF export works with non-quantized models only. Quantized (4-bit) model export is an mlx-lm limitation, not an mlx-tune bug. Load the model with load_in_4bit=False before exporting.
model.push_to_hub()
Push the model or adapters to HuggingFace Hub.
model.stream_generate()
SFT Training
mlx_tune.sft_trainerSFTTrainer
Supervised fine-tuning trainer. API-compatible with TRL’s SFTTrainer. Automatically detects dataset format (Alpaca, ShareGPT, ChatML) and converts to the correct training format.
| Parameter | Type | Description |
|---|---|---|
model | MLXModelWrapper | Model with LoRA adapters configured |
train_dataset | Dataset | HuggingFace dataset or list of dicts |
tokenizer | Tokenizer | Tokenizer from from_pretrained() |
eval_dataset | Dataset, optional | Evaluation dataset |
args | SFTConfig | Training configuration |
SFTConfig
Training configuration. Compatible with TRL’s SFTConfig parameters.
| Parameter | Default | Description |
|---|---|---|
output_dir | "outputs" | Directory for checkpoints and logs |
per_device_train_batch_size | 2 | Batch size per device |
gradient_accumulation_steps | 4 | Number of gradient accumulation steps |
learning_rate | 2e-4 | Peak learning rate |
max_steps | -1 | Total training steps (-1 = use epochs) |
max_seq_length | 2048 | Maximum sequence length |
optim | "adam" | Optimizer (use "adam" for MLX) |
warmup_steps | 5 | Linear warmup steps |
lr_scheduler_type | "linear" | LR scheduler: linear, cosine, constant |
logging_steps | 1 | Log metrics every N steps |
save_steps | 500 | Save checkpoint every N steps |
weight_decay | 0.01 | Weight decay for regularization |
seed | 3407 | Random seed for reproducibility |
RL Trainers
mlx_tune.rl_trainersAll RL trainers use proper loss implementations with full log-probability computation — not wrappers around SFT.
DPOTrainer
Direct Preference Optimization. Uses proper DPO loss with log-probability computation over chosen/rejected pairs.
DPOConfig
| Parameter | Default | Description |
|---|---|---|
beta | 0.1 | KL penalty coefficient (higher = more conservative) |
learning_rate | 5e-7 | Lower than SFT to avoid reward hacking |
ORPOTrainer
Odds Ratio Preference Optimization. Combines SFT loss with odds-ratio preference alignment.
GRPOTrainer
Group Relative Policy Optimization (DeepSeek R1 style). Generates multiple completions per prompt and optimizes based on relative rewards.
GRPOConfig
| Parameter | Default | Description |
|---|---|---|
num_generations | 4 | Number of completions generated per prompt |
beta | 0.04 | KL divergence coefficient |
KTOTrainer
Kahneman-Tversky Optimization. Works with binary feedback (desirable/undesirable) instead of paired preferences. Supports both TRL format (prompt+completion+label) and legacy format (text+label).
KTOConfig
| Parameter | Default | Description |
|---|---|---|
beta | 0.1 | Temperature coefficient for KL penalty |
desirable_weight | 1.0 | Weight for desirable (positive) examples |
undesirable_weight | 1.0 | Weight for undesirable (negative) examples |
learning_rate | 5e-7 | Learning rate |
SimPOTrainer
Simple Preference Optimization. No reference model needed — uses length-normalized log probabilities as implicit rewards. More memory efficient than DPO.
SimPOConfig
| Parameter | Default | Description |
|---|---|---|
beta | 2.0 | Temperature coefficient (typically higher than DPO) |
gamma | 0.5 | Target reward margin between chosen and rejected |
learning_rate | 5e-7 | Learning rate |
Continual Pretraining
mlx_tune.cpt_trainerContinual pretraining (CPT) lets you adapt a pretrained model to a new domain or language using raw text. Unlike SFT, loss is computed on all tokens (not just responses). Supports both LoRA-based CPT and full-weight training.
- Language adaptation — Teach a model a new language with raw text corpora
- Domain knowledge — Inject domain-specific knowledge (medical, legal, scientific)
- Code capabilities — Extend a model’s programming language coverage
CPTTrainer
Continual pretraining trainer. Trains on raw text with loss on all tokens. Optionally trains embed_tokens and lm_head layers with a separate (decoupled) learning rate.
| Parameter | Type | Description |
|---|---|---|
model | MLXModelWrapper | Model with LoRA adapters (or base model for full-weight CPT) |
train_dataset | Dataset | Raw text dataset (each sample has a "text" field) |
tokenizer | Tokenizer | Tokenizer from from_pretrained() |
args | CPTConfig | Training configuration |
CPTConfig
Configuration for continual pretraining. Extends SFTConfig with CPT-specific options.
| Parameter | Default | Description |
|---|---|---|
output_dir | "outputs" | Directory for checkpoints and logs |
learning_rate | 5e-5 | Learning rate for LoRA / main parameters |
include_embeddings | True | Auto-add embed_tokens + lm_head to target modules and unfreeze them |
embedding_learning_rate | lr/5 | Decoupled learning rate for embed_tokens/lm_head (defaults to learning_rate / 5) |
per_device_train_batch_size | 2 | Batch size per device |
max_steps | -1 | Total training steps (-1 = use epochs) |
max_seq_length | 2048 | Maximum sequence length for chunking raw text |
LoRA CPT Example
from mlx_tune import FastLanguageModel, CPTTrainer, CPTConfig
model, tokenizer = FastLanguageModel.from_pretrained(
"mlx-community/SmolLM2-360M-Instruct", # Use base model for CPT
max_seq_length=2048,
)
model = FastLanguageModel.get_peft_model(model, r=16, lora_alpha=16)
# Raw text dataset — loss on ALL tokens (no chat template)
dataset = [{"text": "Your domain-specific text here..."}, ...]
trainer = CPTTrainer(
model=model, train_dataset=dataset, tokenizer=tokenizer,
args=CPTConfig(
learning_rate=5e-5,
embedding_learning_rate=5e-6, # 10x smaller for embeddings
include_embeddings=True, # Auto-adds embed_tokens + lm_head
max_steps=100,
),
)
trainer.train()
Examples
- Example 43 — Language adaptation (teach a new language)
- Example 44 — Domain knowledge injection
- Example 45 — Code capabilities extension
- Example 46 — LFM2 + Continual Pretraining
MoE Fine-Tuning
Mixture of Expertsmlx-tune automatically detects MoE architectures and applies per-expert LoRA via LoRASwitchLinear. No special API — use the same FastLanguageModel and SFTTrainer as dense models.
target_modules=["gate_proj", ...], mlx-tune inspects the model’s actual layer structure and resolves paths dynamically:
- Expert layers:
mlp.switch_mlp.gate_proj→LoRASwitchLinear(per-expert LoRA) - Shared experts:
mlp.shared_expert.gate_proj→LoRALinear - Dense layers:
mlp.gate_proj→LoRALinear(mixed architectures) - Router:
mlp.gate— automatically excluded (not fine-tuned)
Supported MoE Models
| Model | Total / Active Params | MLX Model ID |
|---|---|---|
| Arcee Trinity-Nano (AFMoE) | 6B / 1B | mlx-community/Trinity-Nano-Preview-4bit — 128 experts + 1 shared, sigmoid routing, gated attention |
| Gemma 4 26B-A4B | 26B / ~4B | mlx-community/gemma-4-26b-a4b-it-4bit (VLM path) |
| Qwen3.5-35B-A3B | 35B / 3B | mlx-community/Qwen3.5-35B-A3B-4bit |
| Qwen3-30B-A3B | 30B / 3B | mlx-community/Qwen3-30B-A3B-Instruct-2507-4bit |
| Phi-3.5-MoE | 42B / 6.6B | mlx-community/Phi-3.5-MoE-instruct-4bit |
| Mixtral-8x7B | 46B / 12B | mlx-community/Mixtral-8x7B-Instruct-v0.1-hf-4bit-mlx |
| ...and all other MoE architectures supported by mlx-lm (39+ total) | ||
Embedding Models
mlx_tune.embeddingsFine-tune sentence embedding models for semantic search using contrastive learning (InfoNCE loss). Supports BERT, ModernBERT, Qwen3-Embedding, Microsoft Harrier, and other sentence-transformers compatible architectures.
config.json model_type. LoRA targets are resolved per architecture:
- BERT / XLM-RoBERTa:
attention.self.{query, key, value} - ModernBERT:
attn.Wqkv(fused QKV) - Qwen3 / Gemma3 / Harrier:
self_attn.{q_proj, k_proj, v_proj, o_proj}
Supported Models
| Model | Architecture | Pooling | HuggingFace ID |
|---|---|---|---|
| all-MiniLM-L6-v2 | BERT | Mean | mlx-community/all-MiniLM-L6-v2-bf16 |
| Qwen3-Embedding-0.6B | Qwen3 | Last Token | mlx-community/Qwen3-Embedding-0.6B-4bit-DWQ |
| Harrier 0.6B | Qwen3 | Last Token | microsoft/harrier-oss-v1-0.6b |
| Harrier 270M | Gemma3 | Last Token | microsoft/harrier-oss-v1-270m |
| ...and other models supported by mlx-embeddings (BERT, XLM-RoBERTa, ModernBERT, etc.) | |||
FastEmbeddingModel
Load a sentence embedding model. Architecture and LoRA targets are auto-detected.
| Parameter | Type | Description |
|---|---|---|
model_name | str | HuggingFace model ID or local path |
max_seq_length | int | Maximum token length (default: 512) |
pooling_strategy | str | "mean", "cls", or "last_token" (use "last_token" for decoder-based models) |
load_in_4bit | bool | Load in 4-bit quantization |
Apply LoRA adapters. Targets are auto-detected from model architecture.
| Parameter | Type | Description |
|---|---|---|
r | int | LoRA rank (default: 16) |
target_modules | list | Override auto-detected targets (optional) |
lora_alpha | int | LoRA scaling factor (default: 16) |
lora_dropout | float | Dropout rate (default: 0.0) |
from mlx_tune import FastEmbeddingModel
# BERT-based (mean pooling)
model, tokenizer = FastEmbeddingModel.from_pretrained(
"mlx-community/all-MiniLM-L6-v2-bf16",
)
# Decoder-based (last-token pooling)
model, tokenizer = FastEmbeddingModel.from_pretrained(
"microsoft/harrier-oss-v1-0.6b",
pooling_strategy="last_token",
)
model = FastEmbeddingModel.get_peft_model(model, r=8, lora_alpha=16)
EmbeddingSFTTrainer
Train embedding models with contrastive learning. Supports InfoNCE (in-batch negatives), cosine embedding, and triplet loss.
| Config Parameter | Default | Description |
|---|---|---|
loss_type | "infonce" | "infonce", "cosine", or "triplet" |
temperature | 0.05 | InfoNCE temperature (lower = sharper) |
per_device_train_batch_size | 32 | Batch size (larger = more in-batch negatives) |
learning_rate | 2e-5 | Optimizer learning rate |
max_steps | — | Total training steps |
max_seq_length | 512 | Maximum token length |
normalize_embeddings | True | L2-normalize embeddings before loss |
from mlx_tune import EmbeddingSFTTrainer, EmbeddingSFTConfig, EmbeddingDataCollator
train_data = [
{"anchor": "What is LoRA?", "positive": "Low-rank adaptation for efficient fine-tuning."},
# ...
]
trainer = EmbeddingSFTTrainer(
model=model, tokenizer=tokenizer,
data_collator=EmbeddingDataCollator(model, tokenizer),
train_dataset=train_data,
args=EmbeddingSFTConfig(
loss_type="infonce", temperature=0.05,
per_device_train_batch_size=10, max_steps=30,
),
)
trainer.train()
# Encode and compare
embeddings = model.encode(["query text", "document text"])
similarity = (embeddings[0] * embeddings[1]).sum().item()
Examples
- Example 27 — BERT/MiniLM embedding fine-tuning
- Example 28 — Qwen3-Embedding fine-tuning (4-bit)
- Example 31 — Microsoft Harrier 0.6B (cross-lingual search)
- Example 32 — Microsoft Harrier 270M (code/doc search)
Chat Templates
mlx_tune.chat_templatesget_chat_template()
Apply a chat template to the tokenizer. Supports 15 model families with "auto" detection from model name.
| Template | Aliases |
|---|---|
llama-3 | llama3, llama-3.1, llama-3.2 |
gemma | gemma-2, gemma2 |
qwen-2.5 | qwen25, qwen2.5 |
qwen-3 | qwen3 |
phi-3 | phi3 |
phi-4 | phi4 |
mistral-7b | mistral |
deepseek | deepseek-v2 |
command-r | cohere |
llama-2 | llama2 |
neural-chat | |
solar | |
tulu-2 | |
zephyr | |
alpaca |
train_on_responses_only()
Modify trainer to compute loss only on assistant response tokens (prompt tokens are masked). Significantly improves training quality.
from mlx_tune import train_on_responses_only
trainer = train_on_responses_only(
trainer,
instruction_part="<|start_header_id|>user<|end_header_id|>\n\n",
response_part="<|start_header_id|>assistant<|end_header_id|>\n\n",
)
Dataset Utilities
"alpaca", "sharegpt", or "chatml"Save, merge, and export
After training, you have several options for saving and deploying your model.
Save LoRA adapters
model.save_pretrained("./lora_model")
Merge LoRA into base model
model.save_pretrained_merged("./merged", tokenizer)
Export to GGUF
model.save_pretrained_gguf("./gguf", tokenizer, quantization_method="q4_k_m")
Convert HuggingFace model to MLX
FastLanguageModel.convert("meta-llama/Llama-3-8B", quantize=True)
Push to HuggingFace Hub
model.push_to_hub("username/my-model")
GGUF export works with non-quantized models only. If you loaded with load_in_4bit=True, you must reload the base model without quantization before exporting to GGUF. This is an mlx-lm limitation.
Example scripts
Ready-to-run examples covering the full LLM fine-tuning workflow.
| Example | Description |
|---|---|
| 01 – 03 | Basics: model loading, LoRA configuration, inference |
| 04 – 07 | SFT training: dataset preparation, training loop, saving |
| 08 | Full Unsloth-compatible SFT pipeline (recommended starting point) |
| 09 | RL training overview: all 5 methods in one script |
| 21 | DPO — E2E preference tuning with Qwen3.5 |
| 22 | GRPO — Reasoning training with custom reward functions (DeepSeek R1 style) |
| 23 | ORPO — Combined SFT + preference alignment (no reference model) |
| 24 | KTO — Binary feedback training (no paired preferences needed) |
| 25 | SimPO — Simple preference optimization (no reference model) |
| 29 | MoE — Qwen3.5-35B-A3B fine-tuning (35B total, 3B active) |
| 30 | MoE — Phi-3.5-MoE fine-tuning (42B total, Microsoft) |
| 54 | Trinity-Nano (AFMoE) — instruction SFT on the 6B/1B-active MoE with 128 experts + 1 shared |
| 55 | Trinity-Nano (AFMoE) — GRPO reasoning with <reasoning>/<answer> reward shaping |
| 56 | Trinity-Nano (AFMoE) — continual pretraining with decoupled embedding LR |
| 41 | LFM2 — Liquid Foundation Model SFT fine-tuning (hybrid gated-conv + GQA) |
| 42 | LFM2 — Thinking/reasoning fine-tuning |
| 43 | CPT — Language adaptation with continual pretraining |
| 44 | CPT — Domain knowledge injection (medical/legal/scientific) |
| 45 | CPT — Code capabilities extension |
| 46 | CPT + LFM2 — Continual pretraining on Liquid Foundation Model |
Browse all examples on GitHub · See the Examples page for code snippets.