// deploy · gpu-cold-start

GPU Cold Starts: The 120-Second UX Killer

Cold start times documented across Replicate, fal.ai, Modal, and RunPod. What causes serverless GPU cold starts, which providers handle them best, and how to fix them in production.

Published 2026-05-12gpu cold startreplicate cold startserverless gpu cold start

Your API averages 3 seconds per image. Your p99 is 127 seconds. That gap is not a bug in your code - it is a GPU cold start, and it is silently breaking the experience for a meaningful percentage of your users.

Cold starts in serverless GPU infrastructure are a specific and well-understood problem. This article explains exactly what causes them, how each major provider handles them, and what strategies actually reduce them. All timing data below is from provider documentation and publicly reported figures as of May 2026.

What Is a GPU Cold Start

When a serverless GPU platform scales to zero to save cost, your container (or worker process) is stopped completely. The next request must boot the container from scratch, initialize CUDA, and load the model weights from storage into GPU VRAM before generating a single output.

These three phases stack:

1. Container boot: pulling the Docker image layers, starting the process - typically 1–30 seconds depending on image size and platform caching.

2. CUDA initialization: setting up the CUDA context, allocating GPU memory, loading driver state - typically 1–5 seconds.

3. Model load: reading model weights from disk or object storage into VRAM. SDXL is ~6.5GB, Flux Dev is ~23GB, Flux Schnell is ~16GB. On fast NVMe storage, this is 5–20 seconds. On network storage (S3, GCS), it can be 30–120 seconds depending on bandwidth.

The sum is what your user waits for. For small models on fast storage on a well-optimized platform, a cold start can be 5–10 seconds. For large models with network storage, it can exceed two minutes.

30–120 sec
Cold start range for custom Replicate deployments
Replicate documentation, 2026

Cold Start Times by Provider

The following figures are drawn from provider documentation, developer blogs, and provider-reported benchmark data. No independent third-party benchmark with controlled methodology was found for 2025–2026. Provider-reported numbers reflect best-case conditions and should be treated as minimums.

Replicate

Replicate has two distinct operating modes that behave very differently:

Public models (SDXL, Flux Schnell, Flux Dev, etc. from the public model library): these models are kept warm by Replicate because of their high usage volume. Cold start is effectively zero for the overwhelming majority of requests.

Custom and private model deployments: your own Docker container, or a fine-tuned model deployed privately. These scale to zero when idle. Cold start times of 30–120 seconds are commonly reported, with the Replicate documentation describing them as 'several minutes' for large models. Replicate does not publish a specific SLA for cold start time.

Fine-tuned models via Replicate Trainings (their fine-tuning product): improved significantly in 2024 to sub-1-second cold starts, according to Replicate's own blog post on the subject.

Important: Replicate does not charge for cold start wait time. You are billed only for active GPU compute seconds. The cost is paid by your users in wait time, not by you in invoice.

fal.ai

fal.ai advertises 5–10 second cold starts for standard inference endpoints, and sub-100ms for their real-time inference endpoints (which keep models pre-warmed via persistent connections). These are provider-claimed figures; no independent benchmark was found to verify them.

The real-time endpoint model is architecturally different from scale-to-zero: it keeps a model warm at the cost of idle compute time, similar to setting minimum instances on other platforms. This trades cold start latency for baseline cost.

Modal's container boot time is approximately 1 second. Full cold start (boot + CUDA init + model load) is 2–45 seconds depending on model size. Modal launched GPU memory snapshots in January 2025, which preserve the CUDA context and model state between invocations. With snapshots enabled, previously slow models cold start in 2–5 seconds.

RunPod Serverless

RunPod reports that 48% of cold starts on their serverless platform complete in under 200ms. This likely reflects cases where workers are pre-warmed or the model is already cached. The remaining 52% includes longer starts, but RunPod does not publish a distribution for those cases.

Cold start times by provider - May 2026
ProviderPublic/warm modelsCustom deploymentsBest-case reductionSource
Replicate~0 sec (always warm)30–120 sec (up to 'several minutes')Set min-instances ≥ 1 in DeploymentsReplicate docs
fal.aiSub-100ms (real-time)5–10 sec (claimed)Real-time endpoints (pre-warmed)fal.ai docs
Modal~1 sec (boot)2–45 sec full coldGPU memory snapshots → 2–5 secModal docs, Jan 2025
RunPod serverless~200ms (48% of reqs)Not publishedActive workers (always-on tier)RunPod benchmark report

The UX Math

Cold starts matter differently depending on your product. Here is a concrete way to think about it:

If 5% of your requests hit a cold start and that cold start is 120 seconds, your p95 response time is 120 seconds - regardless of how fast your generation runs after warmup. For a user-facing product where people are waiting for a result, this is a hard failure. Users will not wait two minutes for an image.

For batch processing or async jobs where users are not waiting in real time, cold starts are nearly irrelevant. A 120-second cold start on a job that was going to complete in 10 minutes anyway is not meaningful.

The decision about how aggressively to eliminate cold starts should track with whether your product is interactive or asynchronous.

Strategies to Reduce Cold Starts

1. Minimum Instances (Keep Models Warm)

Every major platform (Replicate Deployments, fal.ai, Modal, RunPod) allows you to configure a minimum number of always-running workers. One warm worker eliminates cold starts for the first request at any given moment. Cost: you pay for idle GPU time 24/7, even when no requests are coming in.

On Replicate, keeping one A100 warm costs $0.001400/sec × 3600 = $5.04 per hour, or roughly $121/day. For a T4, it is $0.81/hour. Choose the minimum hardware that handles your generation time.

On RunPod serverless, the 'Active' pricing tier ($0.00013/sec for RTX 3090) is designed for always-on workers - about 30% cheaper than the Flex (scale-to-zero) tier.

2. Use Pre-Warmed Public Models

If your use case fits a public model (SDXL, Flux Schnell, Flux Dev on Replicate; Flux Kontext Pro or similar on fal.ai), you get near-zero cold starts without paying for minimum instances. The trade-off is that you cannot customize the environment, model weights, or custom nodes.

3. GPU Memory Snapshots (Modal)

Modal's snapshot feature captures the full CUDA context and loaded model state, so subsequent cold starts skip the CUDA init and model load phases entirely. This requires your application to be compatible with Modal's snapshot API but can reduce cold starts from 30+ seconds to 2–5 seconds for large models.

4. Reduce Model Load Time

Storing model weights on fast local NVMe storage (rather than fetching from S3 on every cold start) is the single biggest lever for custom deployments. Platforms that cache model weights locally between invocations (rather than downloading on every cold start) perform significantly better.

5. Multi-Cloud Routing

If you need to serve multiple models or tolerate occasional cold starts without them affecting users, routing requests across providers - sending traffic to whichever instance is already warm - can reduce effective cold start exposure. This requires maintaining active connections to multiple providers and routing logic in your application layer. Some managed platforms handle this transparently; Runflow is one example in the ComfyUI-as-API space.

When Cold Starts Don't Matter

Not every workload needs sub-second inference. Cold starts are irrelevant for:

Batch processing: generating thousands of images overnight for a catalog refresh. A 2-minute cold start on a 4-hour job is noise.

Async user-initiated jobs with email or webhook delivery: if users expect to wait 10–30 minutes for a result anyway, a cold start does not change their experience.

Development and testing: cold starts in dev environments are not worth optimizing.

If your product falls into one of these categories, optimize for cost (scale-to-zero, Flex tier) rather than latency.

When Cold Starts Kill the Product

Cold starts are a critical problem for:

Real-time or near-real-time image generation (photo editor, live preview, instant results promised to users).

Products where the generation is part of an interactive flow and users are watching a spinner.

APIs where downstream clients have hard timeouts below 30 seconds.

If you are building in this category, budget for minimum instances or choose a platform with pre-warmed infrastructure as a first-class feature.

How Cold Starts Compare Across Use Cases

Not all serverless GPU workloads experience cold starts the same way. Public model inference on platforms like Replicate eliminates cold starts entirely by keeping popular models warm. Custom containers and fine-tuned models are most affected. The impact also varies by generation time: a 30-second cold start on a 2-second generation request is catastrophic; the same cold start on a 5-minute video generation job is barely noticeable.

Batch jobs that submit hundreds of requests at once typically hit one cold start (the first) then run warm for the remaining requests. Single-request workloads are the worst case: every request is potentially the first after an idle period.

Choosing the Right Platform for Your Latency Requirements

Match your platform choice to your latency tolerance. If your product requires consistent sub-5-second responses, you need either a pre-warmed model (Replicate public models, fal.ai real-time endpoints) or a platform where you pay for always-on workers. If you can tolerate occasional slow responses (async jobs, background processing), scale-to-zero platforms with no minimum instances are the most cost-efficient.

RunPod's Active worker tier ($0.00013/sec for RTX 3090) is the most cost-efficient always-warm option for custom workloads. Modal's GPU memory snapshots are the best technical solution for reducing cold start time without paying for permanent warm workers. fal.ai's real-time endpoints suit interactive products best. Replicate is optimal for teams using public models who do not need custom containers.

Want to know which models run on your GPU? Try our GPU Matcher to instantly see all compatible models with optimal quantization and memory requirements.

Frequently Asked Questions

What is a GPU cold start?

A GPU cold start is the delay that occurs when a serverless GPU worker has scaled to zero and must restart before processing a request. It includes container boot time, CUDA initialization, and model loading into VRAM - typically 5–120 seconds depending on the platform and model size.

How long are Replicate cold starts?

Public models on Replicate (SDXL, Flux, etc.) have near-zero cold starts because they are kept warm by Replicate's infrastructure. Custom or private model deployments cold start in 30–120 seconds, or up to 'several minutes' for large models, according to Replicate's documentation.

How do I eliminate GPU cold starts?

The most reliable method is to configure minimum instances (at least 1 always-warm worker) in your deployment settings. This eliminates cold starts for the first request at any moment but adds idle GPU cost. Alternatively, use pre-warmed public models or platforms with built-in warm pools.

Does fal.ai have cold starts?

fal.ai advertises 5–10 second cold starts for standard endpoints and sub-100ms for their real-time inference endpoints (which keep models pre-warmed). These are provider-claimed figures; independent benchmarks for 2026 are not publicly available.

What is the difference between Modal GPU memory snapshots and regular cold starts?

Modal's GPU memory snapshots (launched January 2025) capture the CUDA context and model weights in memory, so subsequent cold starts skip the model load phase entirely. This reduces cold starts from 30+ seconds to 2–5 seconds for large models.

How much does it cost to keep a GPU model warm on RunPod?

RunPod's Active worker tier (always-on) costs $0.00013/sec for an RTX 3090 = $0.47/hr = $11.23/day = $337/month. For an A100, Active pricing is $0.00060/sec = $2.16/hr = $51.84/day = $1,555/month. This is significantly cheaper than Replicate's equivalent for sustained always-warm deployments.

Which serverless GPU platform has the fastest cold starts?

Based on available data: fal.ai real-time endpoints claim sub-100ms (pre-warmed), Modal with GPU memory snapshots achieves 2-5 seconds for large models, RunPod reports 48% of cold starts under 200ms. Replicate public models have near-zero cold starts; custom Replicate deployments are slowest at 30-120 seconds. These figures are provider-reported or community-reported; no controlled independent benchmark exists for 2026.

Can I reduce cold starts without paying for always-warm instances?

Yes. Modal's GPU memory snapshots reduce cold starts from 30+ seconds to 2-5 seconds without requiring persistent warm workers. Reducing model size (smaller checkpoints, quantized models) shortens load time. Baking model weights directly into the Docker image eliminates network download time on cold start. These techniques improve cold start time but do not eliminate it entirely.