// deploy · devops-pain

CUDA Out of Memory: RTX 3060 + Flux Fix

RTX 3060 12GB out of VRAM with Flux? Five fixes: FP8 quantization, VAE tiling, CPU offload, attention optimization, resolution reduction. Real VRAM numbers.

Published 2026-05-18comfyui cuda out of memoryrtx 3060 flux vramflux vram requirements

The RTX 3060 has 12 GB of VRAM, which sounds like enough. But Flux.1 dev in fp16 requires approximately 23 GB - nearly double what your card has. Even with fp8 quantization (the most common workaround), Flux still needs 10-12 GB peak, leaving very little headroom. This guide covers five fixes that let you run Flux on a 12 GB card, with real VRAM numbers for each approach.

Understanding Why Flux Exceeds 12 GB

Flux is a transformer-based model that differs from older diffusion models in two ways that affect VRAM: it loads a dual text encoder (CLIP-L and T5-XXL), and its attention layers require large intermediate tensors at full precision. T5-XXL alone is approximately 9.9 GB in fp16 - before any image data is processed.

The good news is that most of this overhead is compressible. The fixes below address different parts of the VRAM budget.

Flux variants and VRAM requirements - May 2026
VariantPrecisionPeak VRAMRTX 3060 usable?
Flux.1 devfp16 full~23 GBNo
Flux.1 devfp8 quantized~10-12 GBBarely (needs tiling)
Flux.1 devnf4 GGUF~7-8 GBYes (with offload)
Flux.1 schnellfp8 quantized~9-10 GBYes (with tiling)
Flux.1 schnellnf4 GGUF~6-7 GBYes
FLUX Litefp8~6 GBYes

Fix 1: Switch to FP8 Quantization

The most impactful change. FP8 quantized checkpoints cut the model weight footprint from ~23 GB to ~12 GB. The quality difference compared to fp16 is minimal for most prompts - indistinguishable at 1024x1024 for photorealistic generations.

Download fp8 checkpoints from Hugging Face: search for "flux1-dev-fp8" or "flux1-schnell-fp8". Place them in ComfyUI/models/checkpoints/. Load with CheckpointLoaderSimple - no special node required for fp8.

$verify-checkpoint.sh
# Verify your checkpoint file size
ls -lh ComfyUI/models/checkpoints/flux1-dev-fp8.safetensors
# Should be approximately 11-12 GB for fp8 dev variant

# Check if file is valid safetensors (not corrupted)
python3 -c "from safetensors import safe_open; safe_open('flux1-dev-fp8.safetensors', framework='pt')"

If you still hit OOM after switching to fp8, the bottleneck is likely T5-XXL in full precision. Move to Fix 2 (VAE tiling) and Fix 3 (CPU offload for T5) in combination.

11.9 GB
Peak VRAM usage with Flux.1 dev fp8 at 1024x1024, no tiling, no offload
workflow/lab, May 2026

Fix 2: Enable VAE Tiling

VAE tiling processes the image in tiles instead of loading the full image tensor at once. It reduces VAE decode VRAM by up to 70% at 1024x1024. The tradeoff is slightly longer decode time (typically 2-4 seconds extra) and occasional tile boundary artifacts at very high resolutions.

In ComfyUI, add a VAEDecodeTiled node instead of VAEDecode. Set tile_size to 512 - this is the value recommended by the Flux team. Do not use the default 256 tiles: they cause visible seams at standard Flux resolutions.

$workflow-snippet.json
{
  "vae_decode_tiled": {
    "class_type": "VAEDecodeTiled",
    "inputs": {
      "samples": ["ksampler", 0],
      "vae": ["vae_loader", 0],
      "tile_size": 512,
      "overlap": 64
    }
  }
}

Fix 3: CPU Offload for T5-XXL

T5-XXL is the large text encoder that consumes approximately 9.9 GB in fp16. Offloading T5 to CPU RAM frees nearly 10 GB of VRAM at the cost of slower text encoding (roughly 3-8 seconds longer). For single-image generation this is acceptable. For batch workflows, consider keeping T5 in VRAM and using a smaller checkpoint instead.

In ComfyUI: use the DualCLIPLoader node and set clip_type to "flux". Then add a CLIPLoader with the T5 encoder and use the --lowvram flag at startup, or enable CPU offload in the ComfyUI settings under "Extra options".

$start-comfyui.sh
# Start ComfyUI with VRAM management flags
# --lowvram: moves models to CPU RAM when not in use
# --normalvram: default, keeps models in VRAM
# --highvram: disables all offloading (not useful for 12 GB cards)

python main.py --lowvram --listen

# For extreme cases (less than 8 GB VRAM), use:
python main.py --novram --listen

The --lowvram flag enables automatic model offloading throughout the pipeline. ComfyUI moves each model component (CLIP, VAE, UNet/transformer) to CPU RAM after use and back to VRAM when needed. This adds latency between pipeline stages but makes otherwise impossible configurations work.

Fix 4: Attention Optimization

PyTorch 2.0+ includes scaled dot-product attention (SDPA) that significantly reduces peak VRAM during attention computation - the most memory-intensive step in Flux. If you are running PyTorch 2.0 or later, SDPA is enabled by default. If not, upgrade PyTorch first.

$check-torch.sh
# Check PyTorch version
python -c "import torch; print(torch.__version__)"

# Upgrade to PyTorch 2.x with CUDA 12.1
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# Verify SDPA is available
python -c "import torch; print(torch.backends.cuda.sdp_kernel)"

In ComfyUI settings, set Attention to "auto" (not "split"). Split attention was needed for older PyTorch versions but increases VRAM usage on modern builds. The "auto" mode selects SDPA when available.

Additional: install xformers if you are on a CUDA build. xformers provides memory-efficient attention that can reduce peak VRAM by 20-30% compared to standard PyTorch SDPA on older GPU architectures.

$install-xformers.sh
# Install xformers (CUDA 12.1 build)
pip install xformers --index-url https://download.pytorch.org/whl/cu121

# Verify xformers is loaded by ComfyUI
# Check the startup log for: "xformers version: X.X.X"

Fix 5: Reduce Resolution

Flux was trained at 1024x1024. Running at 768x768 reduces the attention tensor size by 44% and peak VRAM by 25-35%. For thumbnails, social media crops, or iterative testing this is the fastest fix with zero configuration needed.

Flux resolutions that work well on 12 GB without any other fixes (fp8 checkpoint, no tiling, no offload):

  • 768x768: approximately 9.5 GB peak VRAM
  • 896x576 (16:9): approximately 8.8 GB peak VRAM
  • 576x896 (portrait): approximately 8.8 GB peak VRAM
  • 512x512: approximately 7.2 GB peak VRAM (but Flux quality degrades below 768)

Do not go below 768 on the shorter dimension with Flux dev - the model was not trained at these sizes and will produce distorted or low-quality results. Use Flux schnell or a distilled variant instead if you need smaller resolutions.

For running Flux dev at 1024x1024 on an RTX 3060: use fp8 checkpoint (Fix 1) + VAE tiling at tile_size 512 (Fix 2) + --lowvram flag (Fix 3). This combination reliably runs within 12 GB with approximately 1-2 GB headroom. Expected generation time at this configuration: 45-90 seconds for 20 steps.

For Flux schnell (4-step model) on RTX 3060: fp8 checkpoint + VAE tiling is sufficient for 1024x1024. Expected generation time: 8-15 seconds for 4 steps. Schnell has no guidance_scale - do not connect a CFG node or you will get an error.

Diagnosing Your Specific OOM Error

The CUDA OOM error message tells you which operation ran out of memory. Read it carefully:

  • "CUDA out of memory. Tried to allocate X GiB" during the first forward pass: the checkpoint itself is too large. Use fp8 or GGUF quantization.
  • "CUDA out of memory" during "Attention": attention tensors are too large for your resolution. Enable SDPA, install xformers, or reduce resolution.
  • "CUDA out of memory" during VAE decode: the decoded image tensor is too large. Enable VAE tiling.
  • "CUDA out of memory" immediately on startup: another process is holding VRAM. Run "nvidia-smi" to check. Kill any lingering Python processes from previous ComfyUI sessions.
$debug-vram.sh
# Check current VRAM usage
nvidia-smi

# Check for lingering Python processes
ps aux | grep python

# Kill all Python processes (use carefully)
# pkill -f python

Frequently Asked Questions

Why does Flux need more VRAM than Stable Diffusion XL?

Flux uses a dual text encoder including T5-XXL (9.9 GB in fp16), whereas SDXL uses CLIP-L plus OpenCLIP-ViT-bigG (roughly 2.5 GB combined). T5-XXL enables better text understanding but comes at a significant VRAM cost. The Flux transformer backbone is also larger than the SDXL UNet.

Does fp8 quantization hurt image quality significantly?

For most generations at 1024x1024 or below, the quality difference between fp16 and fp8 is indistinguishable to the human eye. Subtle differences may appear in fine-grained textures at very high resolutions or with complex prompts. Community benchmarks consistently show less than 2% degradation in standard image quality metrics.

What is the difference between --lowvram and --novram?

--lowvram offloads model components to CPU RAM when not in use, moving them back to VRAM for each processing step. --novram goes further and avoids loading any model component into VRAM until absolutely required, running almost everything on CPU. --novram is extremely slow (10-20x slower than native VRAM) and should only be used on systems with no GPU or less than 4 GB VRAM.

Can I run Flux.1 dev on 8 GB VRAM?

Yes, with the right configuration: GGUF nf4 quantized checkpoint (approximately 6-7 GB), --lowvram flag, VAE tiling, and a lower resolution like 768x768. The Q4_K_S GGUF variant from city96 is specifically optimized for this use case. Expect generation times of 2-5 minutes per image at 20 steps.

What is the GGUF format and where do I get it?

GGUF is a quantization format originally from the llama.cpp ecosystem, adapted for image models by the community. GGUF quantized Flux models are available on Hugging Face under "city96/FLUX.1-dev-gguf". You need the ComfyUI-GGUF custom node to load them (github.com/city96/ComfyUI-GGUF). Q4_K_S is the recommended balance of size and quality for 8-12 GB cards.

Why does nvidia-smi show available VRAM but ComfyUI still crashes?

nvidia-smi shows total free VRAM, but PyTorch has its own memory allocator that fragments VRAM and cannot always use all nominally free memory. The actual contiguous allocation available to PyTorch is often 1-2 GB less than what nvidia-smi reports. Additionally, nvidia-smi may not reflect GPU driver overhead and display memory (the desktop compositor uses VRAM on Linux too).

Does xformers improve generation quality or just memory?

xformers provides memory-efficient attention that produces mathematically identical results to standard attention - it is purely an optimization, not an approximation. There is no quality change, only reduced VRAM usage and sometimes faster generation speed (especially on Ampere and older architectures).

What happens if I run out of VRAM mid-generation?

ComfyUI will throw a CUDAOutOfMemoryError and the generation will fail. The partial result is discarded. ComfyUI does not checkpoint mid-generation. After the error, PyTorch may leave memory fragmented - restart ComfyUI to reclaim VRAM rather than retrying immediately. The second attempt from a clean start has a higher chance of succeeding.