Troubleshooting
Common issues and their solutions.
Model Not Found
Error: Model 'xyz' not found
MLX-Tune uses models from the mlx-community/ organization on HuggingFace. CUDA-specific models (with -bnb-4bit suffix from unsloth/) won’t work.
# Don't use CUDA-specific models
# model_name = "unsloth/Meta-Llama-3.1-8B-bnb-4bit" # Won't work!
# Use MLX community models instead
model_name = "mlx-community/Llama-3.2-8B-Instruct-4bit"
Browse available models at huggingface.co/mlx-community.
Out of Memory
Symptom: Process gets killed or system becomes unresponsive during model loading or training.
Recommended models by RAM
| RAM | Recommended Size | Example Model |
|---|---|---|
| 16 GB | 1B–3B, 4-bit | mlx-community/Llama-3.2-1B-Instruct-4bit |
| 32 GB | Up to 7B, 4-bit | mlx-community/Llama-3.2-7B-Instruct-4bit |
| 48 GB | 7B–13B, 4-bit | mlx-community/Llama-3.1-13B-Instruct-4bit |
| 64 GB+ | 13B+ or 8-bit | mlx-community/Llama-3.1-70B-Instruct-4bit |
Solutions
- Use a smaller model
- Use 4-bit quantization (
load_in_4bit=True) - Reduce
max_seq_length - Close other applications (browsers, IDEs consume significant RAM)
- Ensure your macOS is up to date for the latest MLX improvements
Slow Generation
Symptom: Text generation is slower than expected.
Solutions
1. Always enable inference mode before generating:
# Always do this before inference!
FastLanguageModel.for_inference(model)
# Then generate
from mlx_lm import generate
response = generate(model.model, tokenizer,
prompt=prompt, max_tokens=100)
- 2. Use 4-bit quantized models (faster than fp16 for inference)
- 3. Reduce
max_tokensin generation calls - 4. Keep macOS updated for the latest MLX optimizations
- 5. Close memory-heavy applications to free unified memory bandwidth
GGUF Export from Quantized Models
GGUF export (save_pretrained_gguf) doesn’t work with quantized (4-bit) base models. This is a known mlx-lm limitation, not an MLX-Tune bug.
What works
| Training with quantized models (QLoRA) | Works |
Saving adapters (save_pretrained) | Works |
Saving merged model (save_pretrained_merged) | Works |
| Inference with trained model | Works |
| GGUF export from quantized base | Doesn’t work |
Workaround 1: Use a non-quantized base model
# Use fp16 model instead of 4-bit
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="mlx-community/Llama-3.2-1B-Instruct", # NOT -4bit
max_seq_length=2048,
load_in_4bit=False, # Train in fp16
)
# Train normally, then export to GGUF
model.save_pretrained_gguf("model", tokenizer) # Works!
Workaround 2: Dequantize during export
model.save_pretrained_gguf("model", tokenizer, dequantize=True)
# Then re-quantize with llama.cpp:
# ./llama-quantize model.gguf model-q4_k_m.gguf Q4_K_M
Workaround 3: Skip GGUF entirely
If you only need the model for MLX/Python inference, use save_pretrained_merged() instead. No GGUF conversion needed.
VLM Issues
Batch size must be 1
VLM training requires per_device_train_batch_size=1 because images produce variable numbers of vision tokens. The VLMSFTTrainer enforces this automatically. Use gradient_accumulation_steps to simulate larger batch sizes.
Think tags in output
Qwen3.5 models may produce <think>...</think> tags in generated text. MLX-Tune’s generate() method strips these automatically.
Image format
Images should be PIL Image objects. The UnslothVisionDataCollator handles conversion from datasets automatically.
Text-only VLM training
Qwen3.5 can be fine-tuned on text-only data without images. See example 11.
Getting Help
- Check this troubleshooting page first
- Browse the examples for working code
- Open an issue on GitHub
- MLX documentation: ml-explore.github.io/mlx
- mlx-lm issues: github.com/ml-explore/mlx-lm
- mlx-vlm issues: github.com/Blaizzy/mlx-vlm