If you are using Replicate and your users are hitting long wait times on the first request, this article explains exactly what is happening and what you can do about it. The diagnosis and fixes differ depending on how you are using Replicate's platform.
How Replicate's Billing Model Works
Replicate charges per second of GPU compute time. Critically: you are not charged for queue time or cold start wait time. When your container is booting, loading a model, or waiting for a GPU to become available, the clock is not running.
This means cold starts are free for you on Replicate's invoice. They are not free for your users, who still sit waiting.
GPU hardware rates on Replicate as of May 2026 (from replicate.com/pricing):
| Hardware | $/sec | $/hr equivalent |
|---|---|---|
| Nvidia T4 | $0.000225 | $0.81 |
| Nvidia L40S | $0.000975 | $3.51 |
| Nvidia A100 80GB | $0.001400 | $5.04 |
| Nvidia H100 | $0.001525 | $5.49 |
| 2× A100 80GB | $0.002800 | $10.08 |
Two Very Different Experiences on Replicate
Whether you hit cold starts depends almost entirely on which type of Replicate resource you are using.
Public Model Predictions - Near-Zero Cold Start
Replicate's public model library includes thousands of models (SDXL, Flux Schnell, Flux Dev, Flux 1.1 Pro, and many others). Because these models are used by many customers simultaneously, Replicate keeps them warm. Cold start for a popular public model is effectively zero in the vast majority of cases.
If your workflow fits a standard public model without customization, this is the easiest path to a production-grade latency profile on Replicate.
Custom and Private Model Deployments - 30–120 Second Cold Start
If you package your own Docker container - whether it is a custom ComfyUI setup, a fine-tuned model, a multi-LoRA workflow, or a custom inference server - Replicate scales it to zero when idle. The next request after an idle period triggers a cold start.
Replicate's documentation describes this as 'several minutes' for large models. Community reports and developer forum posts consistently put custom deployment cold starts in the 30–120 second range for typical SDXL or Flux-based setups.
There is no published SLA for cold start time on custom deployments.
Fine-Tuned Models via Replicate Trainings - Sub-1 Second
Replicate significantly improved cold start times for models created through their fine-tuning product (Replicate Trainings) in 2024. Their blog post on the subject reported sub-1-second cold starts for fine-tuned models, achieved through better model state caching. If your use case is fine-tuning rather than custom inference containers, this path avoids the 30–120 second problem.
Fix 1: Use Deployments with Minimum Instances
Replicate's Deployments product (distinct from one-off Predictions) allows you to configure minimum and maximum instance counts. Setting min-instances to 1 means there is always at least one warm worker ready to serve requests - cold start goes to zero for the first request at any moment.
The cost: you pay for that instance continuously, even with zero traffic. At Replicate's per-second rates:
| Hardware | $/sec | $/hr | $/day (always-on) |
|---|---|---|---|
| T4 | $0.000225 | $0.81 | $19.44 |
| L40S | $0.000975 | $3.51 | $84.24 |
| A100 80GB | $0.001400 | $5.04 | $120.96 |
| H100 | $0.001525 | $5.49 | $131.76 |
For most image generation workloads, the T4 is enough to serve interactive requests. At $19.44/day, this is only worth it if your product generates enough revenue to justify the fixed infrastructure cost - roughly $600/month for one warm T4 on Replicate.
The Deployments API supports setting min-instances via the REST API or dashboard. You can also configure scale-down delay to keep workers warm for a period after the last request before scaling to zero.
import replicate
# Create a deployment with minimum instances
# (Replicate Deployments API - requires deployment to exist first)
# Set via dashboard: Deployment settings → Minimum instances → 1
# Or via API:
import httpx
response = httpx.patch(
"https://api.replicate.com/v1/deployments/{owner}/{name}",
headers={"Authorization": f"Bearer {REPLICATE_API_TOKEN}"},
json={
"min_instances": 1,
"max_instances": 5,
}
)
Fix 2: Choose Public Models Where Possible
If your use case can be served by a public Replicate model without customization, you get warm infrastructure for free. Replicate's public SDXL model generates 1024×1024 images in roughly 3–5 seconds on L40S hardware with no cold start on the majority of requests.
Flux Schnell (public, $0.003/image), Flux Dev (public, $0.025/image), and SDXL (public, ~$0.0043/run) all benefit from this.
If you need custom ControlNet configurations, multi-LoRA blending, or specific ComfyUI node setups, public models will not cover your requirements. In that case, proceed to Fix 1 or evaluate alternatives.
Fix 3: Reduce Model Load Time
The largest component of cold start time for large models is loading weights from storage into VRAM. If your Docker container fetches model weights from an external URL or S3 bucket on every cold start, you are adding network transfer time on top of disk I/O.
Baking model weights directly into the Docker image - rather than downloading at startup - reduces this step to local disk I/O. Replicate caches Docker image layers between deployments, so a baked-weight image cold-starts faster than a download-on-boot image.
This requires the Docker image to be larger (and slower to push initially), but meaningfully reduces runtime cold start times for large models.
Fix 4: Periodic Warm Pings
A simple approach: send a lightweight prediction (or health check request) to your deployment on a schedule - every 5–10 minutes - to prevent scale-down. This keeps workers warm without paying for permanent minimum instances.
The downside: if your scheduler misses a cycle, or if traffic is uneven enough to exhaust your warm worker pool, you still hit cold starts. This works for low-traffic products with predictable usage patterns; it is not reliable for production APIs with variable traffic.
When to Consider Alternatives
Fixing cold starts on Replicate with minimum instances works, but at a cost that is not trivial. If you are hitting cold starts with a custom deployment and the always-warm cost is significant relative to your revenue, it is worth comparing with alternatives.
| Option | Cold start (custom model) | Monthly cost (1 warm worker) | Trade-off |
|---|---|---|---|
| Replicate min-instances (T4) | ~0 sec | ~$583/mo | Per-second billing; no idle if you scale down manually |
| Replicate min-instances (A100) | ~0 sec | ~$3,629/mo | High cost for large model needs |
| fal.ai real-time endpoint | Sub-100ms (claimed) | Idle compute at fal.ai rates ($0.99–$1.89/hr GPU) | Good for interactive; provider-claimed numbers only |
| Modal (with memory snapshots) | 2–5 sec | Per-second billing; no idle on scale-to-zero | Best cold start/cost tradeoff for custom containers |
| RunPod Active workers (RTX 3090) | Sub-200ms (48% of reqs) | ~$0.00013/sec = ~$340/mo at A100 rates | More control, steeper setup curve |
| Runflow (ComfyUI workflows) | Managed | Contact for pricing | Specific to ComfyUI workflow-as-API use cases |
The right choice depends on your model size, traffic pattern, and how much you value operational simplicity versus cost control. Replicate's warm instances are the simplest fix; Modal's snapshots are the best technical solution for custom containers; RunPod's Active tier is the most cost-efficient for sustained traffic.
Summary: Decision Framework
If you are using public Replicate models: cold starts are unlikely to be your problem. Focus on optimizing generation time instead.
If you are using custom Replicate deployments with low traffic and occasional cold starts: Fix 4 (periodic pings) is the lowest-effort solution. Accept occasional misses.
If you are using custom deployments with a user-facing product and zero tolerance for cold starts: Fix 1 (minimum instances). Budget $600–$3,600/month per warm worker depending on hardware.
If your cold start fix cost exceeds your current Replicate invoice: evaluate alternatives. At that point, platforms with better cold start economics for custom containers (Modal, fal.ai real-time) may be the better call.