Troubleshooting

Common issues and their solutions.

Model Not Found

Error: Model 'xyz' not found

MLX-Tune uses models from the mlx-community/ organization on HuggingFace. CUDA-specific models (with -bnb-4bit suffix from unsloth/) won’t work.

# Don't use CUDA-specific models
# model_name = "unsloth/Meta-Llama-3.1-8B-bnb-4bit"  # Won't work!

# Use MLX community models instead
model_name = "mlx-community/Llama-3.2-8B-Instruct-4bit"
Tip

Browse available models at huggingface.co/mlx-community.

Out of Memory

Symptom: Process gets killed or system becomes unresponsive during model loading or training.

Recommended models by RAM

RAMRecommended SizeExample Model
16 GB1B–3B, 4-bitmlx-community/Llama-3.2-1B-Instruct-4bit
32 GBUp to 7B, 4-bitmlx-community/Llama-3.2-7B-Instruct-4bit
48 GB7B–13B, 4-bitmlx-community/Llama-3.1-13B-Instruct-4bit
64 GB+13B+ or 8-bitmlx-community/Llama-3.1-70B-Instruct-4bit

Solutions

Slow Generation

Symptom: Text generation is slower than expected.

Solutions

1. Always enable inference mode before generating:

# Always do this before inference!
FastLanguageModel.for_inference(model)

# Then generate
from mlx_lm import generate
response = generate(model.model, tokenizer,
    prompt=prompt, max_tokens=100)

GGUF Export from Quantized Models

GGUF export (save_pretrained_gguf) doesn’t work with quantized (4-bit) base models. This is a known mlx-lm limitation, not an MLX-Tune bug.

What works

Training with quantized models (QLoRA)Works
Saving adapters (save_pretrained)Works
Saving merged model (save_pretrained_merged)Works
Inference with trained modelWorks
GGUF export from quantized baseDoesn’t work

Workaround 1: Use a non-quantized base model

# Use fp16 model instead of 4-bit
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="mlx-community/Llama-3.2-1B-Instruct",  # NOT -4bit
    max_seq_length=2048,
    load_in_4bit=False,  # Train in fp16
)
# Train normally, then export to GGUF
model.save_pretrained_gguf("model", tokenizer)  # Works!

Workaround 2: Dequantize during export

model.save_pretrained_gguf("model", tokenizer, dequantize=True)
# Then re-quantize with llama.cpp:
# ./llama-quantize model.gguf model-q4_k_m.gguf Q4_K_M

Workaround 3: Skip GGUF entirely

If you only need the model for MLX/Python inference, use save_pretrained_merged() instead. No GGUF conversion needed.

VLM Issues

Batch size must be 1

VLM training requires per_device_train_batch_size=1 because images produce variable numbers of vision tokens. The VLMSFTTrainer enforces this automatically. Use gradient_accumulation_steps to simulate larger batch sizes.

Think tags in output

Qwen3.5 models may produce <think>...</think> tags in generated text. MLX-Tune’s generate() method strips these automatically.

Image format

Images should be PIL Image objects. The UnslothVisionDataCollator handles conversion from datasets automatically.

Text-only VLM training

Qwen3.5 can be fine-tuned on text-only data without images. See example 11.

Getting Help