Your API averages 3 seconds per image. Your p99 is 127 seconds. That gap is not a bug in your code - it is a GPU cold start, and it is silently breaking the experience for a meaningful percentage of your users.
Cold starts in serverless GPU infrastructure are a specific and well-understood problem. This article explains exactly what causes them, how each major provider handles them, and what strategies actually reduce them. All timing data below is from provider documentation and publicly reported figures as of May 2026.
What Is a GPU Cold Start
When a serverless GPU platform scales to zero to save cost, your container (or worker process) is stopped completely. The next request must boot the container from scratch, initialize CUDA, and load the model weights from storage into GPU VRAM before generating a single output.
These three phases stack:
1. Container boot: pulling the Docker image layers, starting the process - typically 1–30 seconds depending on image size and platform caching.
2. CUDA initialization: setting up the CUDA context, allocating GPU memory, loading driver state - typically 1–5 seconds.
3. Model load: reading model weights from disk or object storage into VRAM. SDXL is ~6.5GB, Flux Dev is ~23GB, Flux Schnell is ~16GB. On fast NVMe storage, this is 5–20 seconds. On network storage (S3, GCS), it can be 30–120 seconds depending on bandwidth.
The sum is what your user waits for. For small models on fast storage on a well-optimized platform, a cold start can be 5–10 seconds. For large models with network storage, it can exceed two minutes.
Cold Start Times by Provider
The following figures are drawn from provider documentation, developer blogs, and provider-reported benchmark data. No independent third-party benchmark with controlled methodology was found for 2025–2026. Provider-reported numbers reflect best-case conditions and should be treated as minimums.
Replicate
Replicate has two distinct operating modes that behave very differently:
Public models (SDXL, Flux Schnell, Flux Dev, etc. from the public model library): these models are kept warm by Replicate because of their high usage volume. Cold start is effectively zero for the overwhelming majority of requests.
Custom and private model deployments: your own Docker container, or a fine-tuned model deployed privately. These scale to zero when idle. Cold start times of 30–120 seconds are commonly reported, with the Replicate documentation describing them as 'several minutes' for large models. Replicate does not publish a specific SLA for cold start time.
Fine-tuned models via Replicate Trainings (their fine-tuning product): improved significantly in 2024 to sub-1-second cold starts, according to Replicate's own blog post on the subject.
Important: Replicate does not charge for cold start wait time. You are billed only for active GPU compute seconds. The cost is paid by your users in wait time, not by you in invoice.
fal.ai
fal.ai advertises 5–10 second cold starts for standard inference endpoints, and sub-100ms for their real-time inference endpoints (which keep models pre-warmed via persistent connections). These are provider-claimed figures; no independent benchmark was found to verify them.
The real-time endpoint model is architecturally different from scale-to-zero: it keeps a model warm at the cost of idle compute time, similar to setting minimum instances on other platforms. This trades cold start latency for baseline cost.
Modal
Modal's container boot time is approximately 1 second. Full cold start (boot + CUDA init + model load) is 2–45 seconds depending on model size. Modal launched GPU memory snapshots in January 2025, which preserve the CUDA context and model state between invocations. With snapshots enabled, previously slow models cold start in 2–5 seconds.
RunPod Serverless
RunPod reports that 48% of cold starts on their serverless platform complete in under 200ms. This likely reflects cases where workers are pre-warmed or the model is already cached. The remaining 52% includes longer starts, but RunPod does not publish a distribution for those cases.
| Provider | Public/warm models | Custom deployments | Best-case reduction | Source |
|---|---|---|---|---|
| Replicate | ~0 sec (always warm) | 30–120 sec (up to 'several minutes') | Set min-instances ≥ 1 in Deployments | Replicate docs |
| fal.ai | Sub-100ms (real-time) | 5–10 sec (claimed) | Real-time endpoints (pre-warmed) | fal.ai docs |
| Modal | ~1 sec (boot) | 2–45 sec full cold | GPU memory snapshots → 2–5 sec | Modal docs, Jan 2025 |
| RunPod serverless | ~200ms (48% of reqs) | Not published | Active workers (always-on tier) | RunPod benchmark report |
The UX Math
Cold starts matter differently depending on your product. Here is a concrete way to think about it:
If 5% of your requests hit a cold start and that cold start is 120 seconds, your p95 response time is 120 seconds - regardless of how fast your generation runs after warmup. For a user-facing product where people are waiting for a result, this is a hard failure. Users will not wait two minutes for an image.
For batch processing or async jobs where users are not waiting in real time, cold starts are nearly irrelevant. A 120-second cold start on a job that was going to complete in 10 minutes anyway is not meaningful.
The decision about how aggressively to eliminate cold starts should track with whether your product is interactive or asynchronous.
Strategies to Reduce Cold Starts
1. Minimum Instances (Keep Models Warm)
Every major platform (Replicate Deployments, fal.ai, Modal, RunPod) allows you to configure a minimum number of always-running workers. One warm worker eliminates cold starts for the first request at any given moment. Cost: you pay for idle GPU time 24/7, even when no requests are coming in.
On Replicate, keeping one A100 warm costs $0.001400/sec × 3600 = $5.04 per hour, or roughly $121/day. For a T4, it is $0.81/hour. Choose the minimum hardware that handles your generation time.
On RunPod serverless, the 'Active' pricing tier ($0.00013/sec for RTX 3090) is designed for always-on workers - about 30% cheaper than the Flex (scale-to-zero) tier.
2. Use Pre-Warmed Public Models
If your use case fits a public model (SDXL, Flux Schnell, Flux Dev on Replicate; Flux Kontext Pro or similar on fal.ai), you get near-zero cold starts without paying for minimum instances. The trade-off is that you cannot customize the environment, model weights, or custom nodes.
3. GPU Memory Snapshots (Modal)
Modal's snapshot feature captures the full CUDA context and loaded model state, so subsequent cold starts skip the CUDA init and model load phases entirely. This requires your application to be compatible with Modal's snapshot API but can reduce cold starts from 30+ seconds to 2–5 seconds for large models.
4. Reduce Model Load Time
Storing model weights on fast local NVMe storage (rather than fetching from S3 on every cold start) is the single biggest lever for custom deployments. Platforms that cache model weights locally between invocations (rather than downloading on every cold start) perform significantly better.
5. Multi-Cloud Routing
If you need to serve multiple models or tolerate occasional cold starts without them affecting users, routing requests across providers - sending traffic to whichever instance is already warm - can reduce effective cold start exposure. This requires maintaining active connections to multiple providers and routing logic in your application layer. Some managed platforms handle this transparently; Runflow is one example in the ComfyUI-as-API space.
When Cold Starts Don't Matter
Not every workload needs sub-second inference. Cold starts are irrelevant for:
Batch processing: generating thousands of images overnight for a catalog refresh. A 2-minute cold start on a 4-hour job is noise.
Async user-initiated jobs with email or webhook delivery: if users expect to wait 10–30 minutes for a result anyway, a cold start does not change their experience.
Development and testing: cold starts in dev environments are not worth optimizing.
If your product falls into one of these categories, optimize for cost (scale-to-zero, Flex tier) rather than latency.
When Cold Starts Kill the Product
Cold starts are a critical problem for:
Real-time or near-real-time image generation (photo editor, live preview, instant results promised to users).
Products where the generation is part of an interactive flow and users are watching a spinner.
APIs where downstream clients have hard timeouts below 30 seconds.
If you are building in this category, budget for minimum instances or choose a platform with pre-warmed infrastructure as a first-class feature.
How Cold Starts Compare Across Use Cases
Not all serverless GPU workloads experience cold starts the same way. Public model inference on platforms like Replicate eliminates cold starts entirely by keeping popular models warm. Custom containers and fine-tuned models are most affected. The impact also varies by generation time: a 30-second cold start on a 2-second generation request is catastrophic; the same cold start on a 5-minute video generation job is barely noticeable.
Batch jobs that submit hundreds of requests at once typically hit one cold start (the first) then run warm for the remaining requests. Single-request workloads are the worst case: every request is potentially the first after an idle period.
Choosing the Right Platform for Your Latency Requirements
Match your platform choice to your latency tolerance. If your product requires consistent sub-5-second responses, you need either a pre-warmed model (Replicate public models, fal.ai real-time endpoints) or a platform where you pay for always-on workers. If you can tolerate occasional slow responses (async jobs, background processing), scale-to-zero platforms with no minimum instances are the most cost-efficient.
RunPod's Active worker tier ($0.00013/sec for RTX 3090) is the most cost-efficient always-warm option for custom workloads. Modal's GPU memory snapshots are the best technical solution for reducing cold start time without paying for permanent warm workers. fal.ai's real-time endpoints suit interactive products best. Replicate is optimal for teams using public models who do not need custom containers.
Want to know which models run on your GPU? Try our GPU Matcher to instantly see all compatible models with optimal quantization and memory requirements.