ComfyUI on Apple Silicon Macs is slower than CUDA on comparable hardware - but often slower than it should be due to configuration issues, not hardware limitations. The four fixes in this guide address the most common performance problems: wrong attention mode, missing memory limit flag, precision fallbacks, and CPU offload triggers. Real generation times across M-series chips are included so you know what to expect after applying the fixes.
How MPS Backend Works in ComfyUI
MPS (Metal Performance Shaders) is Apple's GPU compute framework for PyTorch. ComfyUI uses MPS automatically when no CUDA GPU is detected. MPS maps PyTorch operations to Metal compute shaders, which run on the unified memory GPU in M-series chips.
The key architectural difference from CUDA: Apple Silicon uses unified memory shared between CPU and GPU. There is no separate VRAM - all memory is shared. This means there is no GPU memory transfer bottleneck, but it also means the GPU competes with the OS and other apps for the same memory pool. The --mac-max-memory flag addresses this directly.
| Chip | GPU cores | Unified memory | Flux schnell time | Config |
|---|---|---|---|---|
| M1 | 7 GPU cores | 8 GB | Not recommended | Insufficient memory for Flux |
| M1 Pro | 14 GPU cores | 16 GB | ~4.5 min | fp16, cross-attention |
| M2 Pro | 19 GPU cores | 16 GB | ~3.2 min | fp16, cross-attention |
| M3 Pro | 18 GPU cores | 18 GB | ~2.1 min | fp16, cross-attention |
| M4 Pro | 20 GPU cores | 24 GB | ~1.2 min | fp16, cross-attention |
| M4 Max | 40 GPU cores | 48 GB | ~35 sec | fp16, cross-attention |
Fix 1: Use the Correct Attention Mode
The most impactful fix for MPS speed. The default attention mode in ComfyUI (split attention or sdp) does not always select the optimal implementation for MPS. Using --use-pytorch-cross-attention forces PyTorch's native cross-attention implementation, which is better optimized for MPS on Apple Silicon than the split attention fallback.
# Start ComfyUI with the correct attention mode for Apple Silicon
python main.py --use-pytorch-cross-attention --listen
# If you are running from the ComfyUI launcher app, edit:
# ~/Library/Application Support/ComfyUI/user/default/comfy.settings.json
# Or add a start script with the flag
# Verify it is working by checking the startup log:
# "Using pytorch cross attention"On M3 and M4 chips, this flag alone can cut generation time by 30-50% compared to the default attention mode. On M1 and M2, the improvement is smaller (15-25%) because their MPS implementations are less mature for cross-attention operations.
Fix 2: Set the Memory Limit Flag
On Apple Silicon, PyTorch with MPS defaults to using a conservative amount of the shared memory pool. This causes it to offload tensors to system RAM unnecessarily, causing slow memory transfers even though CPU and GPU share the same physical RAM.
The --mac-max-memory flag tells PyTorch to use more of the unified memory pool. The value is a fraction of total memory (0.0 to 1.0) or a specific byte count. For a Mac with 16 GB unified memory, setting 12 GB (0.75 of total) leaves enough for the OS while letting PyTorch use most of the available memory.
# 16 GB Mac: use 12 GB for ComfyUI
python main.py --use-pytorch-cross-attention --mac-max-memory 12288 --listen
# 12288 MB = 12 GB
# 24 GB Mac: use 18 GB for ComfyUI
python main.py --use-pytorch-cross-attention --mac-max-memory 18432 --listen
# 36 GB Mac: use 28 GB for ComfyUI
python main.py --use-pytorch-cross-attention --mac-max-memory 28672 --listen
# 48 GB Mac: use 40 GB for ComfyUI
python main.py --use-pytorch-cross-attention --mac-max-memory 40960 --listenReserve at least 4 GB for the OS and background processes. Going above 90% of total memory causes the OS to swap aggressively, which is worse than the conservative default. 70-80% of total memory is the recommended range.
Fix 3: Force FP16 Precision
ComfyUI may fall back to fp32 on MPS in some configurations, particularly with custom nodes or older PyTorch builds. FP32 doubles memory usage and is significantly slower on MPS than fp16. Force fp16 explicitly:
# Force fp16 for both the model and VAE
python main.py --use-pytorch-cross-attention --force-fp16 --listen
# Verify in the startup log:
# "Using fp16 precision" or "Model dtype: torch.float16"
# Check current PyTorch MPS support status
python3 -c "
import torch
print('MPS available:', torch.backends.mps.is_available())
print('MPS built:', torch.backends.mps.is_built())
print('Torch version:', torch.__version__)
"Note: bf16 is not supported on all M-series chips via MPS. M2 and later have native bf16 support in MPS, but M1 does not. If you are on M1 and ComfyUI reports a bf16 error, use fp16 (--force-fp16) instead. On M2+, bf16 is worth testing as it can provide better numerical stability than fp16 without the speed penalty of fp32.
Fix 4: Avoid CPU Fallback Triggers
Certain operations in custom nodes or non-standard workflow configurations cause PyTorch to fall back from MPS to CPU for those operations. CPU operations on an M-series chip are not much slower than MPS for small tensors, but the memory transfer overhead from moving data between CPU and GPU contexts can add significant latency.
Operations that commonly trigger CPU fallback on MPS:
- torch.nonzero() - used by some masking nodes
- Certain scatter operations in ControlNet nodes
- F.interpolate with mode="bicubic" and align_corners in older PyTorch versions
- Any node that calls .cpu() or .numpy() explicitly on tensors
# Enable MPS fallback to CPU (prevents crashes but is slower)
# This environment variable allows CPU fallback instead of erroring
export PYTORCH_ENABLE_MPS_FALLBACK=1
python main.py --use-pytorch-cross-attention --listen
# To identify which operations are falling back:
# The startup log will show "MPS fallback to CPU for operation: X" when PYTORCH_ENABLE_MPS_FALLBACK=1PYTORCH_ENABLE_MPS_FALLBACK=1 prevents crashes but does not prevent the performance penalty. Use it as a diagnostic tool to keep ComfyUI running, then identify and replace the nodes causing fallbacks. If a core ComfyUI operation (not a custom node) requires CPU fallback, update to the latest ComfyUI version - MPS compatibility has improved significantly with each release.
Comparing MPS vs CUDA for Production Use
Honest assessment: MPS is viable for development and low-volume generation (under 50 images per day) but is not competitive with CUDA for high-throughput production use. An M4 Pro generates Flux schnell images in approximately 1.2 minutes; an RTX 4080 generates the same image in approximately 8 seconds.
Where Mac with MPS makes sense:
- Iterative workflow development - test locally, deploy to CUDA in production
- Light production use where the hardware cost of a dedicated GPU server is not justified
- Scenarios where data privacy requires on-device inference (customer data that cannot leave the machine)
Where to use CUDA instead:
- Batch processing more than 50 images per day
- Real-time or near-real-time API serving
- Flux dev at 20+ steps (Mac times become impractical)
Model Compatibility on Mac
Not all Flux models work on Mac. T5-XXL in fp16 requires 9.9 GB of the unified memory pool, leaving limited headroom on 16 GB Macs. For 16 GB Macs, use Flux schnell (4 steps, no T5-XXL required in the minimal configuration) or a GGUF quantized Flux model to reduce memory usage.
# Check available unified memory on Mac
system_profiler SPHardwareDataType | grep "Memory:"
# For 16 GB Mac, recommended Flux configuration:
# - Flux schnell fp8 (approximately 9 GB)
# - Skip T5-XXL text encoder (use CLIP-L only for initial testing)
# - Resolution: 768x768 to stay within memory budget
# - Steps: 4 (schnell only)Recommended Full Configuration for Mac
The full recommended startup command for Apple Silicon Macs, combining all four fixes:
# For 16 GB Mac (M2 Pro or M3)
PYTORCH_ENABLE_MPS_FALLBACK=1 python main.py \
--use-pytorch-cross-attention \
--force-fp16 \
--mac-max-memory 12288 \
--listen
# For 24+ GB Mac (M3 Pro, M4 Pro, M4 Max)
PYTORCH_ENABLE_MPS_FALLBACK=1 python main.py \
--use-pytorch-cross-attention \
--force-fp16 \
--mac-max-memory 18432 \
--listen
# Create an alias for convenience (add to ~/.zshrc)
alias comfyui="PYTORCH_ENABLE_MPS_FALLBACK=1 python ~/ComfyUI/main.py --use-pytorch-cross-attention --force-fp16 --mac-max-memory 12288 --listen"