// deploy · devops-pain

NaNsException: Tensor With All NaNs Was Produced (Solved)

NaNsException in ComfyUI Flux? Five causes: wrong VAE, FP16 overflow, CFG scale, corrupt checkpoint, bad sampler. Systematic diagnosis and verified fixes.

Published 2026-05-18nans exception tensorcomfyui nan errorflux nan values

NaNsException: Tensor With All NaNs Was Produced is one of the more confusing ComfyUI errors because it can have five completely different causes, each requiring a different fix. This guide covers systematic diagnosis so you identify the root cause in your specific setup rather than trying fixes at random.

What NaNs Actually Mean

NaN stands for "Not a Number" - a floating-point value that results from invalid operations like dividing by zero or taking the square root of a negative number. In neural network inference, NaN values propagate: once a single NaN appears in a tensor, all subsequent operations produce NaN. By the time ComfyUI reports the error, the original source may be several layers back.

The error is thrown by ComfyUI when it detects that the output tensor from a sampler step contains exclusively NaN values. This is a safety check added in a 2023 update - older versions would silently generate black or noise images instead of crashing.

NaN causes and likelihood by setup - May 2026
CauseProbabilityAffectsQuick test
Wrong VAE for modelHighFlux + custom VAEsUse bundled VAE
FP16 overflowHighFlux on older GPUsSwitch to bf16 or fp32
CFG scale too highMediumFlux specificallySet CFG to 1.0
Corrupt checkpointMediumAny modelRe-download and verify hash
Wrong sampler/schedulerMediumFlux schnellUse euler + simple

Cause 1: Wrong VAE

Flux uses a custom 16-channel VAE, different from the standard SD 1.5 4-channel VAE and the SDXL 4-channel VAE. Using any VAE not designed for Flux will produce NaNs immediately during the decode step. This is the most common cause of NaNsException in Flux workflows.

The correct VAE files for Flux: ae.safetensors (the official Flux VAE, available from black-forest-labs on Hugging Face). Do not use vae-ft-mse-840000-ema-pruned.safetensors (SD 1.5 VAE) or sdxl_vae.safetensors (SDXL VAE) with Flux models.

$check-vae.sh
# List your VAE files and their sizes
ls -lh ComfyUI/models/vae/

# Flux ae.safetensors should be approximately 335 MB
# SD 1.5 VAE is approximately 334 MB (similar size - check filename, not size)
# SDXL VAE is approximately 334 MB (same issue)

# Download the correct Flux VAE
# Place in ComfyUI/models/vae/ae.safetensors

In your workflow, the VAELoader node must point to ae.safetensors, not any other VAE. If you use a CheckpointLoaderSimple with an all-in-one Flux checkpoint that includes the VAE, ensure the checkpoint was not compiled with a substituted VAE.

Cause 2: FP16 Overflow on Older GPUs

FP16 (half-precision float) has a maximum representable value of approximately 65,504. Flux attention layers can produce intermediate values that exceed this limit on certain GPU architectures, causing overflow to infinity and then NaN. This is particularly common on Turing (RTX 20xx) and older Ampere (RTX 30xx on some drivers) GPUs.

The fix is to use bf16 (bfloat16) instead of fp16. BF16 has the same range as fp32 (approximately 3.4 x 10^38) but with reduced precision - it cannot overflow in the same way. Most Flux fp8 checkpoints are internally stored in fp8 but compute in bf16 on supported hardware.

$precision-flags.sh
# Force bf16 computation (add to ComfyUI startup)
python main.py --bf16-vae --listen

# Alternatively, force fp32 for maximum stability (slower)
python main.py --force-fp32 --listen

# Check which precision ComfyUI is using
# Look in startup logs for: "VAE dtype: torch.bfloat16" or similar

Note: BF16 requires Ampere (RTX 30xx) or newer GPU. On Turing (RTX 20xx), bf16 falls back to fp32 automatically. If you are on a Turing GPU and seeing NaN with fp16, the solution is --force-fp32 at the cost of higher VRAM usage.

RTX 20xx
Turing GPUs are most prone to FP16 overflow with Flux - switch to --force-fp32
workflow/lab, May 2026

Cause 3: CFG Scale Set for Standard Diffusion

This is a common mistake when migrating a workflow from SDXL or SD 1.5 to Flux. Standard diffusion models use CFG (classifier-free guidance) with typical values of 5-12. Flux uses a different guidance mechanism called guidance_scale with typical values of 1.0-4.0 - and the parameter is in a different node.

If you connect a KSampler CFG value of 7.0 (typical for SDXL) to a Flux model, the extremely high guidance can cause NaN through numerical instability. Flux dev is designed to work with CFG at exactly 1.0 in the KSampler (which effectively disables classifier-free guidance), with guidance controlled instead via the FluxGuidance node.

$flux-cfg-correct.json
{
  "ksampler": {
    "class_type": "KSampler",
    "inputs": {
      "cfg": 1.0,
      "guidance": 3.5,
      "sampler_name": "euler",
      "scheduler": "simple",
      "steps": 20
    }
  },
  "flux_guidance": {
    "class_type": "FluxGuidance",
    "inputs": {
      "conditioning": ["clip_encode", 0],
      "guidance": 3.5
    }
  }
}

For Flux schnell (the 4-step distilled variant), set guidance to 0 or do not use the FluxGuidance node at all. Schnell was distilled without guidance and adding it causes NaN or degraded results.

Cause 4: Corrupt Checkpoint File

A partially downloaded or corrupt checkpoint will load without error but produce NaN during inference. The safetensors format does not validate checksums on load - it trusts the file contents. Corruption can happen from an interrupted download, a disk write error, or antivirus software modifying the file.

To verify a checkpoint, check the SHA-256 hash against the value published on the model card. Most Flux checkpoints on Hugging Face include a sha256 hash in the model card or .sha256 sidecar file.

$verify-checkpoint.sh
# Compute SHA-256 of your checkpoint
sha256sum ComfyUI/models/checkpoints/flux1-dev-fp8.safetensors

# Compare to expected hash from Hugging Face model card
# Example expected hash (always verify against the actual source):
# 31a2e4c74ac13c5dd87de49f6b03e02b77e7b97e1ca4d2e5c1c0fb2f02e4a8f3

# If hashes don't match, re-download:
# wget https://huggingface.co/.../flux1-dev-fp8.safetensors

# Quick integrity check (no reference hash needed)
python3 -c "
from safetensors import safe_open
import sys
try:
    f = safe_open(sys.argv[1], framework='pt')
    keys = list(f.keys())
    print(f'OK: {len(keys)} tensors loaded')
except Exception as e:
    print(f'CORRUPT: {e}')
" ComfyUI/models/checkpoints/flux1-dev-fp8.safetensors

Cause 5: Incompatible Sampler and Scheduler Combination

Not all sampler and scheduler combinations work with Flux. Certain combinations produce unstable sampling trajectories that generate NaN, especially on the first or last denoising step. This is more common with Flux than with older diffusion models because of the flow-matching training objective.

Verified working combinations for Flux dev and schnell:

  • euler + simple (recommended for Flux dev, 20 steps, guidance 3.5)
  • euler + sgm_uniform (alternative for Flux dev)
  • dpmpp_2m + sgm_uniform (longer generations, Flux dev)
  • euler + simple (Flux schnell, 4 steps, guidance 0)

Known problematic combinations that can produce NaN:

  • dpm_adaptive (adaptive step count - unstable with Flux at default settings)
  • uni_pc + karras (karras schedule incompatible with flow-matching models)
  • dpmpp_sde + karras (can overflow at high guidance values)

Systematic Diagnosis Workflow

If you are unsure which cause applies, follow this sequence in order. Each step eliminates one cause:

  • Step 1: Load a simple test workflow - just CheckpointLoaderSimple + VAELoader + CLIPTextEncode x2 + KSampler + VAEDecode + SaveImage. No custom nodes, no LoRA, no ControlNet.
  • Step 2: Use ae.safetensors as your VAE. If the simple workflow succeeds, your original VAE was the problem.
  • Step 3: Set KSampler CFG to 1.0 and add a FluxGuidance node set to 3.5. Use euler + simple. If this fixes it, your CFG settings were the problem.
  • Step 4: Add --force-fp32 to your startup flags. If this fixes it, you had a precision overflow issue.
  • Step 5: Verify your checkpoint SHA-256. If the hash does not match, re-download the checkpoint.
$debug-nan.sh
# Step 4: Launch ComfyUI with fp32 forced
python main.py --force-fp32 --listen

# Monitor for the error during generation
# If it succeeds, you had an fp16 overflow - switch to bf16 for better performance
python main.py --bf16-vae --listen

Preventing Future NaNs

Once you identify and fix your NaN source, these habits prevent recurrence: always use the bundled VAE when available (all-in-one Flux checkpoints include the correct VAE), keep CFG at 1.0 for Flux and use FluxGuidance for guidance control, and verify checkpoint hashes after downloading. When adding custom nodes, add them one at a time so you can identify which node introduces instability.

Frequently Asked Questions

Can NaN errors damage the GPU or the model file?

No. NaN values are a software-level computation error. They do not damage GPU hardware or corrupt model weights on disk. The checkpoint file is read-only during inference - ComfyUI never writes to it. The error crashes the current generation and discards the partial result, but both the GPU and the model file are unaffected.

Why does the NaN error appear at step 15 of 20, not at step 1?

NaN values may not trigger the safety check immediately. If the NaN appears in a small region of a large tensor and the rest of the values are valid, early denoising steps may complete successfully. As NaNs propagate through subsequent operations, they eventually contaminate the entire tensor. The step number where ComfyUI reports the error is when NaN coverage reaches 100%, not when the first NaN was produced.

I am using the correct VAE but still getting NaN - what else could it be?

Check whether you are running Flux through a ComfyUI workflow that was designed for SDXL - these often have incorrect CFG values (7+ instead of 1.0) and may use the wrong sampler. Also check if any LoRA or custom node is processing tensors in a way that introduces NaN - remove all LoRAs and custom nodes, test with the minimal workflow, then add them back one at a time.

Does Flux schnell require a completely different workflow than Flux dev?

The node structure is the same, but the parameters differ significantly: schnell uses 4 steps (not 20), no guidance scale (FluxGuidance node should be omitted or set to 0), and the euler + simple sampler combination. Using 20 steps with Flux schnell wastes compute and can produce artifacts. Using guidance with schnell can cause NaN because schnell was distilled without guidance conditioning.

What is the difference between bf16 and fp16 for Flux?

Both are 16-bit floating point formats but with different bit distributions. FP16 allocates 5 bits to the exponent and 10 bits to the mantissa, giving high precision but a limited range (max value approximately 65,504). BF16 allocates 8 bits to the exponent (same as fp32) and 7 bits to the mantissa, giving fp32-level range with reduced precision. For neural network inference, range matters more than mantissa precision, making bf16 more numerically stable for large-value operations like Flux attention.

Can I use NaN-safe operations to prevent the error rather than fixing the root cause?

ComfyUI does not expose NaN-safe operation modes directly. You can add torch.nan_to_num() wrappers in custom nodes during development, but this masks the underlying problem rather than fixing it. The correct approach is always to find and fix the root cause. Masking NaN values produces garbage outputs that look like generated images but contain corrupted data.

How do I know if my GPU supports bf16 natively?

NVIDIA GPUs with native bf16 support: Ampere (RTX 30xx, A series) and newer. Turing (RTX 20xx) and older do not support bf16 natively - PyTorch will fall back to fp32 when bf16 is requested, which is slower but correct. You can verify with: python -c "import torch; print(torch.cuda.is_bf16_supported())"

I re-downloaded the checkpoint but still get NaN. Now what?

Verify you deleted the old (potentially cached) checkpoint before downloading. PyTorch and ComfyUI do not cache checkpoints, but some download managers resume from partial files. Delete the file completely, clear browser/wget cache, and download fresh. After downloading, verify the SHA-256 hash before loading. If you downloaded via browser, try the huggingface_hub Python library instead: huggingface-cli download black-forest-labs/FLUX.1-dev.