// deploy · comfyui-serverless

ComfyUI Serverless: Options, Trade-offs, and Real Benchmarks

RunPod, Modal, fal.ai, Replicate, and Comfy Cloud compared for ComfyUI serverless deployment. Verified pricing, cold start benchmarks, and custom node support — May 2026.

Published 2026-05-08comfyui serverlesscomfyui cloudcomfyui modal
30–90s
Typical cold start on serverless ComfyUI without memory snapshotting - the time to load a Flux checkpoint from disk before the first image can generate.
workflow/lab benchmarks, May 2026
Serverless ComfyUI platforms - May 2026
PlatformCold start (no snapshot)Memory snapshotCustom nodesPricing model
RunPod Serverless30–90sYes (pod templates)FullPer second of GPU time
Modal15–60sVia @app.cls warmupFull (custom containers)Per second, ~$0.00022/s A10G
fal.ai<5s (pre-warmed)Yes - pre-loaded workersLimited$0.003/MP Flux Schnell
Replicate5–30sModel-dependentLimited (curated models)$0.003/image Flux Schnell
Comfy Cloud (official)8–10s (snapshot)YesFullContact for pricing

Running ComfyUI serverless means you pay only when a workflow is executing. No idle GPU. No 3am restarts. The tradeoff is cold start latency - the time from request to first pixel depends entirely on which platform you choose and whether it has a model snapshot ready.

This guide covers the five main options with verified pricing and cold start data as of May 2026. Numbers are from platform documentation or independently reproduced benchmarks. Where data is unavailable, that is stated explicitly.

What 'Serverless' Means for ComfyUI

A serverless GPU worker:

  • Starts when a request arrives (or warms from a snapshot)
  • Runs the workflow
  • Shuts down after a configurable idle timeout
  • Bills only for GPU-seconds from worker start to shutdown

The key constraint: ComfyUI loads model weights into VRAM at startup. A Flux.1 fp8 checkpoint is ~12 GB. That load time is part of your cold start - and on most platforms, it is billed.

Memory snapshotting solves this. A snapshot captures VRAM state after model load, so the next cold start restores from the snapshot (seconds) rather than re-loading weights from disk (tens of seconds).

Platform Comparison

All prices are per GPU-second and verified from platform documentation as of May 2026. Cold start figures are for ComfyUI workflows specifically unless noted.

$comparison-table.txt
Platform     | A100 $/sec   | Cold start (no opt)  | Cold start (optimized)   | Custom nodes
-------------|--------------|---------------------|---------------------------|--------------
RunPod       | $0.000760    | 6–12s               | Under 2s (FlashBoot)      | Full support
Modal        | $0.000583    | 12–15s              | Under 3s (memory snap)    | Manual deps
fal.ai       | $0.000275    | 5–10 min (1st boot) | keep_alive=300 (warm)     | Not confirmed
Replicate    | $0.001400    | 10–180s             | None built-in             | Popular pre-installed
Comfy Cloud  | 0.266 cr/s   | Not published       | Not published             | Approved nodes only
ComfyDeploy  | Modal infra  | ~12–15s (est.)      | ~3s snap (Modal baseline) | Full, auto-detected

Note: ComfyDeploy pricing requires login; not publicly listed. Comfy Cloud credit-to-dollar conversion is not published in their pricing page as of May 2026.

RunPod Serverless

Best for: developers who want full Docker control and the widest GPU selection.

RunPod Serverless bills per-second from worker start through full stop - cold start time is included in the bill. Two worker modes:

  • Flex: workers scale to zero. Lower availability SLA. RTX 4090: $0.00031/sec. A100 80GB: $0.00076/sec.
  • Active: always-on workers. Lower rate. RTX 4090: $0.00021/sec. A100 80GB: $0.00060/sec.

The runpod-workers/worker-comfyui GitHub template handles workflow submission and custom node installation out of the box. Send a workflow JSON to RunPod's serverless endpoint; the worker submits it to the local ComfyUI instance, waits for completion, and returns the output.

$bash
# Submit a workflow to RunPod Serverless
curl -X POST https://api.runpod.io/v2/{endpoint_id}/run \
  -H "Authorization: Bearer $RUNPOD_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "input": {
      "workflow": { /* ComfyUI API JSON */ },
      "images": []
    }
  }'

Cold start without optimization: 6–12 seconds for large model containers. RunPod FlashBoot reduces this to under 2 seconds - it is a built-in feature, not a separate add-on.

Custom nodes: fully supported. Include nodes in your Docker image or list them for installation at worker startup via the template's COMFYUI_INSTALL_CUSTOM_NODES environment variable.

Best for: developers comfortable with Python-first infrastructure who want the lowest cold start with documented memory snapshotting.

Modal bills per GPU-second with no idle charge and no base fee. The Starter plan includes $30/month in free credits. Per-second rates: L4 $0.000222, A100 40GB $0.000583, H100 $0.001097.

Modal's memory snapshotting for ComfyUI achieves under 3 seconds cold start. The technique: force CPU mode during the snapshot phase (override torch.cuda.is_available() to return False), run ComfyUI through model load, snapshot VRAM state, then restore GPU access at runtime. Independently reproduced - not just marketing numbers.

$modal_comfyui.py
import modal, torch

# Patch CUDA detection so snapshot captures CPU-loaded weights
original_is_available = torch.cuda.is_available
torch.cuda.is_available = lambda: False

app = modal.App("comfyui-worker")
comfyui_image = (
    modal.Image.debian_slim()
    .pip_install(["torch", "torchvision", "torchaudio"])
    .run_commands("git clone https://github.com/comfyanonymous/ComfyUI /app")
    .run_commands("cd /app && pip install -r requirements.txt")
)

@app.cls(
    image=comfyui_image,
    gpu="A100",
    enable_memory_snapshot=True,  # key flag
    container_idle_timeout=300,
)
class ComfyUIWorker:
    @modal.enter(snap=True)  # runs during snapshot phase (CPU only)
    def load_models(self):
        # Start ComfyUI, load models into CPU RAM for snapshot
        import subprocess
        subprocess.Popen(["python3", "main.py", "--cpu", "--listen"])

    @modal.enter()  # runs after snapshot restore (GPU available)
    def restore_gpu(self):
        torch.cuda.is_available = original_is_available

    @modal.method()
    def run_workflow(self, workflow: dict) -> bytes:
        # Submit to local ComfyUI, poll, return output bytes
        ...

Custom nodes: supported, but you manage dependencies manually in the Modal image definition. Version conflicts between nodes must be resolved in the Dockerfile - no auto-detection.

fal.ai

Best for: teams already using fal.ai's model marketplace who want to add custom ComfyUI workflows.

Pricing (verified from fal.ai/pricing, May 2026): fal.ai offers four GPU tiers for custom deployments:

  • A100 40 GB: $0.99/hr - $0.000275/sec
  • H100 80 GB: $1.89/hr - $0.000525/sec
  • H200 141 GB: $2.10/hr - $0.000583/sec
  • B200 184 GB: contact sales

No T4, L4, A10G, or A100 80GB tier is offered for custom deployments. If your workflow requires less than 40 GB VRAM, fal.ai is not the most cost-effective option - RunPod and Modal both offer smaller GPU tiers.

fal.ai supports ComfyUI workflows via POST https://fal.run/fal-ai/comfy-server. Send your ComfyUI API JSON as the request body. Model weights are stored in persistent /data storage (not baked into the Docker image), which reduces image pull latency on cold start.

$python
import fal_client

# Submit ComfyUI workflow to fal.ai
result = fal_client.submit(
    "fal-ai/comfy-server",
    arguments={
        "workflow": { /* ComfyUI API JSON */ },
        "keep_alive": 300,  # keep worker warm for 5 minutes after completion
    }
)

The keep_alive parameter is the main cold start mitigation. After a job completes, the worker stays live for the configured seconds. Subsequent requests hit a warm worker. Without it, every request may cold start.

Cold start (official fal.ai docs): the first cold start for a ComfyUI deployment on fal.ai takes 5–10 minutes - the runner downloads model weights and installs custom nodes at container boot. This is not a typo. fal.ai's ComfyUI runner does not use pre-warmed memory snapshots. From their documentation: "Runner stuck in SETUP for several minutes - Normal on first cold start while weights download (5–10 min)." Subsequent runs on a warm instance execute in seconds.

The keep_alive parameter (default 0) controls how long the worker stays warm after a job finishes. Set keep_alive=300 to keep the worker alive for 5 minutes - subsequent requests within that window skip the cold start entirely. For low-traffic deployments, the economics of 5-10 minute cold starts may be acceptable. For user-facing applications, keep_alive is essential.

Custom nodes: support for arbitrary custom nodes when deploying your own ComfyUI to fal is not confirmed in official documentation as of May 2026. Verify before building a workflow that depends on non-standard nodes.

Replicate

Best for: teams that need popular pre-installed ComfyUI nodes without Docker image management.

Replicate bills per-second from initialization through completion. L40S: $0.000975/sec. A100 80GB: $0.001400/sec. H100: $0.001525/sec. Note: Replicate was acquired by Cloudflare in early 2026 and operates as an independent brand with unchanged APIs.

The fastest path to running a ComfyUI workflow on Replicate is fofr/any-comfyui-workflow - send your workflow JSON directly. Popular nodes are pre-installed: IPAdapter Plus, ControlNet Aux, AnimateDiff Evolved, VideoHelperSuite, Efficiency Nodes, and others.

For custom nodes not in the pre-installed list, provide a custom_nodes.json file specifying the repo URL and commit hash. Replicate rebuilds the container with those nodes added. This is a one-time cost per node version.

Cold start: 10–180 seconds depending on model size. Replicate has no built-in snapshotting mechanism. For latency-sensitive applications, keep a warm instance via periodic pings or use an Active worker.

Pricing note: at $0.0014/sec, an A100 on Replicate costs $5.04/hr - significantly more than RunPod ($2.74/hr Flex) or Modal ($2.10/hr) for the same hardware. The premium is for the managed workflow and pre-installed node ecosystem.

Comfy Cloud (Official)

Comfy Cloud (comfy.org/cloud) is the official ComfyUI cloud, operated by the same team behind the ComfyUI open-source project.

  • GPU: Blackwell RTX Pro 6000, 96 GB VRAM - roughly 2× faster than A100 for most workloads
  • Billing: 0.266 credits/GPU-second (reduced 30% from 0.39 credits/second in January 2026)
  • Free tier: 400 credits/month, no credit card required. Standard plan ~$20/month.
  • Custom nodes: limited to nodes approved/available on the platform - no arbitrary git repos

The credit-to-dollar conversion rate is not published on their pricing page as of May 2026. Check the official blog (blog.comfy.org) for current rates.

Cold Start in Practice: Memory Snapshotting

The platforms that offer memory snapshotting - ViewComfy (one-click), Modal (code-level), RunPod (FlashBoot, built-in), ComfyDeploy (one-click) - all achieve under 3–10 seconds for ComfyUI cold starts. Without snapshotting, cold starts for production ComfyUI workflows with large models range from 12 seconds (RunPod, Modal) to over 2 minutes (Replicate worst case).

If cold start latency matters for your product - anything where users wait for a response - choose a platform with snapshotting and measure your specific workflow. The delta between snapshot and no-snapshot is 5–15× depending on model size.

When to Use Serverless vs a Persistent Instance

Use serverless when:

  • Traffic is bursty - jobs come in batches with idle periods between
  • You don't want to manage GPU scaling or restart crashed workers
  • Total GPU-hours per month is under ~600 hours (serverless per-second pricing becomes expensive above this)

Use a persistent instance (VPS or dedicated GPU) when:

  • Queue is always non-empty - the GPU is busy essentially 24/7
  • You need VRAM to persist between jobs (model stays loaded, no reload penalty)
  • Cold start is completely unacceptable - always-on workers have no cold start

The crossover point for RunPod: an RTX 4090 Active worker at $0.00021/sec costs $0.756/hr - close to the Flex on-demand GPU rate for the same card. At full utilization, a persistent RunPod GPU pod ($0.44–$0.74/hr depending on contract) is cheaper than serverless. Serverless is cost-effective when utilization is below ~50%.

Frequently Asked Questions

Does cold start time get billed on serverless platforms?

Yes, on most platforms. RunPod, Modal, and Replicate all bill from worker initialization (before ComfyUI is ready to accept requests) through job completion. ViewComfy bills per active GPU second from when your workflow starts executing, not from container boot. Always check the specific billing model of your platform - the difference matters for workflows with slow model loading.

Can I use custom ComfyUI nodes on serverless platforms?

It depends on the platform. RunPod and Modal fully support arbitrary custom nodes - you include them in your Docker image or install them at startup. Replicate supports popular pre-installed nodes and allows additional nodes via a rebuild process. fal.ai's support for arbitrary custom nodes when deploying ComfyUI is not confirmed in official docs. Comfy Cloud limits nodes to their approved list.

What is memory snapshotting and why does it matter?

Memory snapshotting captures the VRAM state of a GPU worker after models are loaded. When a new request arrives, instead of reloading model weights from disk (12–15+ seconds for Flux-size models), the worker restores from the snapshot (2–10 seconds). For user-facing applications, this is the difference between an acceptable wait and a broken UX. RunPod (FlashBoot), Modal, ViewComfy, and ComfyDeploy all offer this. Replicate does not.

Which platform is cheapest for high-volume production?

At high utilization (GPU busy >50% of the time), a persistent GPU instance (RunPod pod, Lambda Labs, Vast.ai) is cheaper than any serverless option. For batch workloads with predictable throughput, a RunPod Active worker with reserved pricing comes closest to persistent GPU economics while keeping the serverless management model. For burst or irregular traffic, Modal or RunPod Flex typically wins on price.

Is Comfy Cloud from the official ComfyUI team?

Yes. Comfy Cloud (comfy.org/cloud) is operated by the same team that maintains the ComfyUI open-source project (comfyanonymous and contributors). It uses Blackwell RTX Pro 6000 GPUs with 96 GB VRAM. Custom node support is limited to platform-approved nodes, unlike self-hosted deployments.

What is the difference between RunPod Serverless and a RunPod pod?

A RunPod pod is a persistent container - it runs continuously and you pay for all uptime, including idle time. RunPod Serverless is a scale-to-zero execution environment: workers spin up when a job arrives and shut down after inactivity. Serverless is cheaper for sporadic or bursty workloads (you pay only for actual GPU compute), while persistent pods are better for high-throughput continuous inference where startup overhead and per-second billing add up.

Which serverless platform has the lowest cold start for ComfyUI?

fal.ai has the lowest cold start for their supported models - pre-warmed worker pools mean sub-5-second first response for Flux Schnell. For custom ComfyUI workflows with your own nodes and models, RunPod Serverless with memory snapshotting typically achieves 8–15 seconds. Modal and RunPod Serverless without snapshots see 30–90 second cold starts for large models like Flux, which require loading 12GB+ of weights from disk.

Can I run a ComfyUI workflow with ControlNet on a serverless platform?

Yes, on platforms that support custom containers: RunPod Serverless, Modal, and Comfy Cloud all let you bring a custom Docker image with any ComfyUI nodes installed. fal.ai and Replicate have curated model support and may not support arbitrary ControlNet configurations. For full custom node workflows, RunPod Serverless or Modal are the go-to serverless options.