Is ComfyUI faster on M4 than M3?

Yes, meaningfully so. The M4 Pro offers approximately 40-50% faster generation than the M3 Pro for Flux models, primarily due to improved GPU core count and the higher memory bandwidth available to the MPS backend. The M4 Max is the fastest Mac chip for image generation as of May 2026, with 40 GPU cores and up to 128 GB unified memory.

Why is ComfyUI slower on my Mac than in YouTube tutorials?

Most YouTube tutorial benchmarks use CUDA GPUs (RTX cards) and are 5-20x faster than Mac MPS for the same models. If the tutorial features a Mac, check whether they used the --use-pytorch-cross-attention flag and a recent PyTorch version - older tutorials from 2023 show much slower times than current optimized configurations. Also verify your PyTorch version is 2.3+ for the best MPS performance.

Can I use CUDA on a Mac?

No. CUDA is NVIDIA-specific and is not available on Apple Silicon or any Mac hardware. Apple Silicon uses MPS (Metal Performance Shaders). Intel Macs with NVIDIA GPUs (pre-2019) can run CUDA, but those GPUs are outdated for current AI workloads and are not supported by recent NVIDIA drivers on macOS.

What is the minimum Mac spec for running Flux?

Flux schnell at 768x768 requires approximately 10-12 GB unified memory with fp16 precision. This means a 16 GB Mac (M1 Pro or later) is the practical minimum. The base M1 with 8 GB can run smaller models (SD 1.5, SDXL with optimizations) but not Flux. For Flux dev at 1024x1024 with T5-XXL, 24 GB unified memory is recommended.

Does --force-fp16 reduce image quality on Mac?

No perceptible quality difference for standard image generation. FP16 has sufficient precision for diffusion model inference. The output is visually identical to fp32 at the resolutions commonly used (512-1536px). FP16 reduces memory usage by 50% and significantly improves generation speed on MPS compared to fp32.

My Mac shows 100% GPU utilization but generation is still slow. Is something wrong?

100% GPU utilization is expected during generation - it means the GPU is working at capacity, not that something is wrong. The generation time reflects the hardware capability at that utilization. If you want faster generation with 100% GPU utilization, the only options are a faster chip (M4 vs M3) or model quantization (switching from fp16 to a smaller GGUF quantized model).

The PYTORCH_ENABLE_MPS_FALLBACK=1 flag makes my Mac run ComfyUI in CPU mode entirely. Why?

This happens when your PyTorch installation does not include MPS support (built without Metal). Verify with: python3 -c "import torch; print(torch.backends.mps.is_available())". If this returns False, reinstall PyTorch for Mac: pip install torch torchvision torchaudio. The MPS-enabled build is the default for Apple Silicon on PyTorch.org - do not use the CUDA or CPU-only packages.

Should I use ComfyUI Desktop for Mac or run from source?

For a smooth startup, ComfyUI Desktop is easier and handles the Python environment automatically. For maximum performance control and custom flags, running from source (python main.py ...) is better because you can pass all the optimization flags described in this guide. ComfyUI Desktop as of May 2026 supports custom launch arguments in its settings, so either approach works for applying the fixes in this article.

ComfyUI on Mac: Apple Silicon MPS Speed Issues (Fix)

ComfyUI on Apple Silicon Macs is slower than CUDA on comparable hardware - but often slower than it should be due to configuration issues, not hardware limitations. The four fixes in this guide address the most common performance problems: wrong attention mode, missing memory limit flag, precision fallbacks, and CPU offload triggers. Real generation times across M-series chips are included so you know what to expect after applying the fixes.

How MPS Backend Works in ComfyUI

MPS (Metal Performance Shaders) is Apple's GPU compute framework for PyTorch. ComfyUI uses MPS automatically when no CUDA GPU is detected. MPS maps PyTorch operations to Metal compute shaders, which run on the unified memory GPU in M-series chips.

The key architectural difference from CUDA: Apple Silicon uses unified memory shared between CPU and GPU. There is no separate VRAM - all memory is shared. This means there is no GPU memory transfer bottleneck, but it also means the GPU competes with the OS and other apps for the same memory pool. The --mac-max-memory flag addresses this directly.

ComfyUI generation times on Apple Silicon - Flux schnell, 1024x1024, 4 steps, May 2026

Chip	GPU cores	Unified memory	Flux schnell time	Config
M1	7 GPU cores	8 GB	Not recommended	Insufficient memory for Flux
M1 Pro	14 GPU cores	16 GB	~4.5 min	fp16, cross-attention
M2 Pro	19 GPU cores	16 GB	~3.2 min	fp16, cross-attention
M3 Pro	18 GPU cores	18 GB	~2.1 min	fp16, cross-attention
M4 Pro	20 GPU cores	24 GB	~1.2 min	fp16, cross-attention
M4 Max	40 GPU cores	48 GB	~35 sec	fp16, cross-attention

Fix 1: Use the Correct Attention Mode

The most impactful fix for MPS speed. The default attention mode in ComfyUI (split attention or sdp) does not always select the optimal implementation for MPS. Using --use-pytorch-cross-attention forces PyTorch's native cross-attention implementation, which is better optimized for MPS on Apple Silicon than the split attention fallback.

$start-mps-correct.sh

# Start ComfyUI with the correct attention mode for Apple Silicon
python main.py --use-pytorch-cross-attention --listen

# If you are running from the ComfyUI launcher app, edit:
# ~/Library/Application Support/ComfyUI/user/default/comfy.settings.json
# Or add a start script with the flag

# Verify it is working by checking the startup log:
# "Using pytorch cross attention"

On M3 and M4 chips, this flag alone can cut generation time by 30-50% compared to the default attention mode. On M1 and M2, the improvement is smaller (15-25%) because their MPS implementations are less mature for cross-attention operations.

40%

Typical generation time reduction on M3 Pro when switching from default to --use-pytorch-cross-attention

workflow/lab, May 2026

Fix 2: Set the Memory Limit Flag

On Apple Silicon, PyTorch with MPS defaults to using a conservative amount of the shared memory pool. This causes it to offload tensors to system RAM unnecessarily, causing slow memory transfers even though CPU and GPU share the same physical RAM.

The --mac-max-memory flag tells PyTorch to use more of the unified memory pool. The value is a fraction of total memory (0.0 to 1.0) or a specific byte count. For a Mac with 16 GB unified memory, setting 12 GB (0.75 of total) leaves enough for the OS while letting PyTorch use most of the available memory.

$mac-memory-flags.sh

# 16 GB Mac: use 12 GB for ComfyUI
python main.py --use-pytorch-cross-attention --mac-max-memory 12288 --listen
# 12288 MB = 12 GB

# 24 GB Mac: use 18 GB for ComfyUI
python main.py --use-pytorch-cross-attention --mac-max-memory 18432 --listen

# 36 GB Mac: use 28 GB for ComfyUI
python main.py --use-pytorch-cross-attention --mac-max-memory 28672 --listen

# 48 GB Mac: use 40 GB for ComfyUI
python main.py --use-pytorch-cross-attention --mac-max-memory 40960 --listen

Reserve at least 4 GB for the OS and background processes. Going above 90% of total memory causes the OS to swap aggressively, which is worse than the conservative default. 70-80% of total memory is the recommended range.

Fix 3: Force FP16 Precision

ComfyUI may fall back to fp32 on MPS in some configurations, particularly with custom nodes or older PyTorch builds. FP32 doubles memory usage and is significantly slower on MPS than fp16. Force fp16 explicitly:

$force-fp16-mac.sh

# Force fp16 for both the model and VAE
python main.py --use-pytorch-cross-attention --force-fp16 --listen

# Verify in the startup log:
# "Using fp16 precision" or "Model dtype: torch.float16"

# Check current PyTorch MPS support status
python3 -c "
import torch
print('MPS available:', torch.backends.mps.is_available())
print('MPS built:', torch.backends.mps.is_built())
print('Torch version:', torch.__version__)
"

Note: bf16 is not supported on all M-series chips via MPS. M2 and later have native bf16 support in MPS, but M1 does not. If you are on M1 and ComfyUI reports a bf16 error, use fp16 (--force-fp16) instead. On M2+, bf16 is worth testing as it can provide better numerical stability than fp16 without the speed penalty of fp32.

Fix 4: Avoid CPU Fallback Triggers

Certain operations in custom nodes or non-standard workflow configurations cause PyTorch to fall back from MPS to CPU for those operations. CPU operations on an M-series chip are not much slower than MPS for small tensors, but the memory transfer overhead from moving data between CPU and GPU contexts can add significant latency.

Operations that commonly trigger CPU fallback on MPS:

torch.nonzero() - used by some masking nodes
Certain scatter operations in ControlNet nodes
F.interpolate with mode="bicubic" and align_corners in older PyTorch versions
Any node that calls .cpu() or .numpy() explicitly on tensors

$mps-fallback.sh

# Enable MPS fallback to CPU (prevents crashes but is slower)
# This environment variable allows CPU fallback instead of erroring
export PYTORCH_ENABLE_MPS_FALLBACK=1
python main.py --use-pytorch-cross-attention --listen

# To identify which operations are falling back:
# The startup log will show "MPS fallback to CPU for operation: X" when PYTORCH_ENABLE_MPS_FALLBACK=1

PYTORCH_ENABLE_MPS_FALLBACK=1 prevents crashes but does not prevent the performance penalty. Use it as a diagnostic tool to keep ComfyUI running, then identify and replace the nodes causing fallbacks. If a core ComfyUI operation (not a custom node) requires CPU fallback, update to the latest ComfyUI version - MPS compatibility has improved significantly with each release.

Comparing MPS vs CUDA for Production Use

Honest assessment: MPS is viable for development and low-volume generation (under 50 images per day) but is not competitive with CUDA for high-throughput production use. An M4 Pro generates Flux schnell images in approximately 1.2 minutes; an RTX 4080 generates the same image in approximately 8 seconds.

Where Mac with MPS makes sense:

Iterative workflow development - test locally, deploy to CUDA in production
Light production use where the hardware cost of a dedicated GPU server is not justified
Scenarios where data privacy requires on-device inference (customer data that cannot leave the machine)

Where to use CUDA instead:

Batch processing more than 50 images per day
Real-time or near-real-time API serving
Flux dev at 20+ steps (Mac times become impractical)

Model Compatibility on Mac

Not all Flux models work on Mac. T5-XXL in fp16 requires 9.9 GB of the unified memory pool, leaving limited headroom on 16 GB Macs. For 16 GB Macs, use Flux schnell (4 steps, no T5-XXL required in the minimal configuration) or a GGUF quantized Flux model to reduce memory usage.

$mac-memory-check.sh

# Check available unified memory on Mac
system_profiler SPHardwareDataType | grep "Memory:"

# For 16 GB Mac, recommended Flux configuration:
# - Flux schnell fp8 (approximately 9 GB)
# - Skip T5-XXL text encoder (use CLIP-L only for initial testing)
# - Resolution: 768x768 to stay within memory budget
# - Steps: 4 (schnell only)

Recommended Full Configuration for Mac

The full recommended startup command for Apple Silicon Macs, combining all four fixes:

$mac-full-config.sh

# For 16 GB Mac (M2 Pro or M3)
PYTORCH_ENABLE_MPS_FALLBACK=1 python main.py \
  --use-pytorch-cross-attention \
  --force-fp16 \
  --mac-max-memory 12288 \
  --listen

# For 24+ GB Mac (M3 Pro, M4 Pro, M4 Max)
PYTORCH_ENABLE_MPS_FALLBACK=1 python main.py \
  --use-pytorch-cross-attention \
  --force-fp16 \
  --mac-max-memory 18432 \
  --listen

# Create an alias for convenience (add to ~/.zshrc)
alias comfyui="PYTORCH_ENABLE_MPS_FALLBACK=1 python ~/ComfyUI/main.py --use-pytorch-cross-attention --force-fp16 --mac-max-memory 12288 --listen"