The RTX 3060 has 12 GB of VRAM, which sounds like enough. But Flux.1 dev in fp16 requires approximately 23 GB - nearly double what your card has. Even with fp8 quantization (the most common workaround), Flux still needs 10-12 GB peak, leaving very little headroom. This guide covers five fixes that let you run Flux on a 12 GB card, with real VRAM numbers for each approach.
Understanding Why Flux Exceeds 12 GB
Flux is a transformer-based model that differs from older diffusion models in two ways that affect VRAM: it loads a dual text encoder (CLIP-L and T5-XXL), and its attention layers require large intermediate tensors at full precision. T5-XXL alone is approximately 9.9 GB in fp16 - before any image data is processed.
The good news is that most of this overhead is compressible. The fixes below address different parts of the VRAM budget.
| Variant | Precision | Peak VRAM | RTX 3060 usable? |
|---|---|---|---|
| Flux.1 dev | fp16 full | ~23 GB | No |
| Flux.1 dev | fp8 quantized | ~10-12 GB | Barely (needs tiling) |
| Flux.1 dev | nf4 GGUF | ~7-8 GB | Yes (with offload) |
| Flux.1 schnell | fp8 quantized | ~9-10 GB | Yes (with tiling) |
| Flux.1 schnell | nf4 GGUF | ~6-7 GB | Yes |
| FLUX Lite | fp8 | ~6 GB | Yes |
Fix 1: Switch to FP8 Quantization
The most impactful change. FP8 quantized checkpoints cut the model weight footprint from ~23 GB to ~12 GB. The quality difference compared to fp16 is minimal for most prompts - indistinguishable at 1024x1024 for photorealistic generations.
Download fp8 checkpoints from Hugging Face: search for "flux1-dev-fp8" or "flux1-schnell-fp8". Place them in ComfyUI/models/checkpoints/. Load with CheckpointLoaderSimple - no special node required for fp8.
# Verify your checkpoint file size
ls -lh ComfyUI/models/checkpoints/flux1-dev-fp8.safetensors
# Should be approximately 11-12 GB for fp8 dev variant
# Check if file is valid safetensors (not corrupted)
python3 -c "from safetensors import safe_open; safe_open('flux1-dev-fp8.safetensors', framework='pt')"If you still hit OOM after switching to fp8, the bottleneck is likely T5-XXL in full precision. Move to Fix 2 (VAE tiling) and Fix 3 (CPU offload for T5) in combination.
Fix 2: Enable VAE Tiling
VAE tiling processes the image in tiles instead of loading the full image tensor at once. It reduces VAE decode VRAM by up to 70% at 1024x1024. The tradeoff is slightly longer decode time (typically 2-4 seconds extra) and occasional tile boundary artifacts at very high resolutions.
In ComfyUI, add a VAEDecodeTiled node instead of VAEDecode. Set tile_size to 512 - this is the value recommended by the Flux team. Do not use the default 256 tiles: they cause visible seams at standard Flux resolutions.
{
"vae_decode_tiled": {
"class_type": "VAEDecodeTiled",
"inputs": {
"samples": ["ksampler", 0],
"vae": ["vae_loader", 0],
"tile_size": 512,
"overlap": 64
}
}
}Fix 3: CPU Offload for T5-XXL
T5-XXL is the large text encoder that consumes approximately 9.9 GB in fp16. Offloading T5 to CPU RAM frees nearly 10 GB of VRAM at the cost of slower text encoding (roughly 3-8 seconds longer). For single-image generation this is acceptable. For batch workflows, consider keeping T5 in VRAM and using a smaller checkpoint instead.
In ComfyUI: use the DualCLIPLoader node and set clip_type to "flux". Then add a CLIPLoader with the T5 encoder and use the --lowvram flag at startup, or enable CPU offload in the ComfyUI settings under "Extra options".
# Start ComfyUI with VRAM management flags
# --lowvram: moves models to CPU RAM when not in use
# --normalvram: default, keeps models in VRAM
# --highvram: disables all offloading (not useful for 12 GB cards)
python main.py --lowvram --listen
# For extreme cases (less than 8 GB VRAM), use:
python main.py --novram --listenThe --lowvram flag enables automatic model offloading throughout the pipeline. ComfyUI moves each model component (CLIP, VAE, UNet/transformer) to CPU RAM after use and back to VRAM when needed. This adds latency between pipeline stages but makes otherwise impossible configurations work.
Fix 4: Attention Optimization
PyTorch 2.0+ includes scaled dot-product attention (SDPA) that significantly reduces peak VRAM during attention computation - the most memory-intensive step in Flux. If you are running PyTorch 2.0 or later, SDPA is enabled by default. If not, upgrade PyTorch first.
# Check PyTorch version
python -c "import torch; print(torch.__version__)"
# Upgrade to PyTorch 2.x with CUDA 12.1
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
# Verify SDPA is available
python -c "import torch; print(torch.backends.cuda.sdp_kernel)"In ComfyUI settings, set Attention to "auto" (not "split"). Split attention was needed for older PyTorch versions but increases VRAM usage on modern builds. The "auto" mode selects SDPA when available.
Additional: install xformers if you are on a CUDA build. xformers provides memory-efficient attention that can reduce peak VRAM by 20-30% compared to standard PyTorch SDPA on older GPU architectures.
# Install xformers (CUDA 12.1 build)
pip install xformers --index-url https://download.pytorch.org/whl/cu121
# Verify xformers is loaded by ComfyUI
# Check the startup log for: "xformers version: X.X.X"Fix 5: Reduce Resolution
Flux was trained at 1024x1024. Running at 768x768 reduces the attention tensor size by 44% and peak VRAM by 25-35%. For thumbnails, social media crops, or iterative testing this is the fastest fix with zero configuration needed.
Flux resolutions that work well on 12 GB without any other fixes (fp8 checkpoint, no tiling, no offload):
- 768x768: approximately 9.5 GB peak VRAM
- 896x576 (16:9): approximately 8.8 GB peak VRAM
- 576x896 (portrait): approximately 8.8 GB peak VRAM
- 512x512: approximately 7.2 GB peak VRAM (but Flux quality degrades below 768)
Do not go below 768 on the shorter dimension with Flux dev - the model was not trained at these sizes and will produce distorted or low-quality results. Use Flux schnell or a distilled variant instead if you need smaller resolutions.
Combining Fixes: The Recommended Stack
For running Flux dev at 1024x1024 on an RTX 3060: use fp8 checkpoint (Fix 1) + VAE tiling at tile_size 512 (Fix 2) + --lowvram flag (Fix 3). This combination reliably runs within 12 GB with approximately 1-2 GB headroom. Expected generation time at this configuration: 45-90 seconds for 20 steps.
For Flux schnell (4-step model) on RTX 3060: fp8 checkpoint + VAE tiling is sufficient for 1024x1024. Expected generation time: 8-15 seconds for 4 steps. Schnell has no guidance_scale - do not connect a CFG node or you will get an error.
Diagnosing Your Specific OOM Error
The CUDA OOM error message tells you which operation ran out of memory. Read it carefully:
- "CUDA out of memory. Tried to allocate X GiB" during the first forward pass: the checkpoint itself is too large. Use fp8 or GGUF quantization.
- "CUDA out of memory" during "Attention": attention tensors are too large for your resolution. Enable SDPA, install xformers, or reduce resolution.
- "CUDA out of memory" during VAE decode: the decoded image tensor is too large. Enable VAE tiling.
- "CUDA out of memory" immediately on startup: another process is holding VRAM. Run "nvidia-smi" to check. Kill any lingering Python processes from previous ComfyUI sessions.
# Check current VRAM usage
nvidia-smi
# Check for lingering Python processes
ps aux | grep python
# Kill all Python processes (use carefully)
# pkill -f python