// deploy · replicate-cold-start

Replicate Cold Starts: What's Happening and How to Fix It

Why Replicate cold starts happen, when they affect custom vs public models, and the exact fixes. Cost breakdown for keeping models warm on Replicate versus switching providers.

Published 2026-05-12replicate cold startreplicate pricingreplicate alternative

If you are using Replicate and your users are hitting long wait times on the first request, this article explains exactly what is happening and what you can do about it. The diagnosis and fixes differ depending on how you are using Replicate's platform.

How Replicate's Billing Model Works

Replicate charges per second of GPU compute time. Critically: you are not charged for queue time or cold start wait time. When your container is booting, loading a model, or waiting for a GPU to become available, the clock is not running.

This means cold starts are free for you on Replicate's invoice. They are not free for your users, who still sit waiting.

GPU hardware rates on Replicate as of May 2026 (from replicate.com/pricing):

Replicate GPU hardware rates - verified May 2026
Hardware$/sec$/hr equivalent
Nvidia T4$0.000225$0.81
Nvidia L40S$0.000975$3.51
Nvidia A100 80GB$0.001400$5.04
Nvidia H100$0.001525$5.49
2× A100 80GB$0.002800$10.08

Two Very Different Experiences on Replicate

Whether you hit cold starts depends almost entirely on which type of Replicate resource you are using.

Public Model Predictions - Near-Zero Cold Start

Replicate's public model library includes thousands of models (SDXL, Flux Schnell, Flux Dev, Flux 1.1 Pro, and many others). Because these models are used by many customers simultaneously, Replicate keeps them warm. Cold start for a popular public model is effectively zero in the vast majority of cases.

If your workflow fits a standard public model without customization, this is the easiest path to a production-grade latency profile on Replicate.

Custom and Private Model Deployments - 30–120 Second Cold Start

If you package your own Docker container - whether it is a custom ComfyUI setup, a fine-tuned model, a multi-LoRA workflow, or a custom inference server - Replicate scales it to zero when idle. The next request after an idle period triggers a cold start.

Replicate's documentation describes this as 'several minutes' for large models. Community reports and developer forum posts consistently put custom deployment cold starts in the 30–120 second range for typical SDXL or Flux-based setups.

There is no published SLA for cold start time on custom deployments.

Fine-Tuned Models via Replicate Trainings - Sub-1 Second

Replicate significantly improved cold start times for models created through their fine-tuning product (Replicate Trainings) in 2024. Their blog post on the subject reported sub-1-second cold starts for fine-tuned models, achieved through better model state caching. If your use case is fine-tuning rather than custom inference containers, this path avoids the 30–120 second problem.

30–120 sec
Typical cold start for custom Replicate deployments
Replicate documentation and community reports

Fix 1: Use Deployments with Minimum Instances

Replicate's Deployments product (distinct from one-off Predictions) allows you to configure minimum and maximum instance counts. Setting min-instances to 1 means there is always at least one warm worker ready to serve requests - cold start goes to zero for the first request at any moment.

The cost: you pay for that instance continuously, even with zero traffic. At Replicate's per-second rates:

Always-warm instance cost on Replicate - per day
Hardware$/sec$/hr$/day (always-on)
T4$0.000225$0.81$19.44
L40S$0.000975$3.51$84.24
A100 80GB$0.001400$5.04$120.96
H100$0.001525$5.49$131.76

For most image generation workloads, the T4 is enough to serve interactive requests. At $19.44/day, this is only worth it if your product generates enough revenue to justify the fixed infrastructure cost - roughly $600/month for one warm T4 on Replicate.

The Deployments API supports setting min-instances via the REST API or dashboard. You can also configure scale-down delay to keep workers warm for a period after the last request before scaling to zero.

$warm_deployment.py
import replicate

# Create a deployment with minimum instances
# (Replicate Deployments API - requires deployment to exist first)
# Set via dashboard: Deployment settings → Minimum instances → 1

# Or via API:
import httpx

response = httpx.patch(
    "https://api.replicate.com/v1/deployments/{owner}/{name}",
    headers={"Authorization": f"Bearer {REPLICATE_API_TOKEN}"},
    json={
        "min_instances": 1,
        "max_instances": 5,
    }
)

Fix 2: Choose Public Models Where Possible

If your use case can be served by a public Replicate model without customization, you get warm infrastructure for free. Replicate's public SDXL model generates 1024×1024 images in roughly 3–5 seconds on L40S hardware with no cold start on the majority of requests.

Flux Schnell (public, $0.003/image), Flux Dev (public, $0.025/image), and SDXL (public, ~$0.0043/run) all benefit from this.

If you need custom ControlNet configurations, multi-LoRA blending, or specific ComfyUI node setups, public models will not cover your requirements. In that case, proceed to Fix 1 or evaluate alternatives.

Fix 3: Reduce Model Load Time

The largest component of cold start time for large models is loading weights from storage into VRAM. If your Docker container fetches model weights from an external URL or S3 bucket on every cold start, you are adding network transfer time on top of disk I/O.

Baking model weights directly into the Docker image - rather than downloading at startup - reduces this step to local disk I/O. Replicate caches Docker image layers between deployments, so a baked-weight image cold-starts faster than a download-on-boot image.

This requires the Docker image to be larger (and slower to push initially), but meaningfully reduces runtime cold start times for large models.

Fix 4: Periodic Warm Pings

A simple approach: send a lightweight prediction (or health check request) to your deployment on a schedule - every 5–10 minutes - to prevent scale-down. This keeps workers warm without paying for permanent minimum instances.

The downside: if your scheduler misses a cycle, or if traffic is uneven enough to exhaust your warm worker pool, you still hit cold starts. This works for low-traffic products with predictable usage patterns; it is not reliable for production APIs with variable traffic.

When to Consider Alternatives

Fixing cold starts on Replicate with minimum instances works, but at a cost that is not trivial. If you are hitting cold starts with a custom deployment and the always-warm cost is significant relative to your revenue, it is worth comparing with alternatives.

Cold start solutions comparison - May 2026
OptionCold start (custom model)Monthly cost (1 warm worker)Trade-off
Replicate min-instances (T4)~0 sec~$583/moPer-second billing; no idle if you scale down manually
Replicate min-instances (A100)~0 sec~$3,629/moHigh cost for large model needs
fal.ai real-time endpointSub-100ms (claimed)Idle compute at fal.ai rates ($0.99–$1.89/hr GPU)Good for interactive; provider-claimed numbers only
Modal (with memory snapshots)2–5 secPer-second billing; no idle on scale-to-zeroBest cold start/cost tradeoff for custom containers
RunPod Active workers (RTX 3090)Sub-200ms (48% of reqs)~$0.00013/sec = ~$340/mo at A100 ratesMore control, steeper setup curve
Runflow (ComfyUI workflows)ManagedContact for pricingSpecific to ComfyUI workflow-as-API use cases

The right choice depends on your model size, traffic pattern, and how much you value operational simplicity versus cost control. Replicate's warm instances are the simplest fix; Modal's snapshots are the best technical solution for custom containers; RunPod's Active tier is the most cost-efficient for sustained traffic.

Summary: Decision Framework

If you are using public Replicate models: cold starts are unlikely to be your problem. Focus on optimizing generation time instead.

If you are using custom Replicate deployments with low traffic and occasional cold starts: Fix 4 (periodic pings) is the lowest-effort solution. Accept occasional misses.

If you are using custom deployments with a user-facing product and zero tolerance for cold starts: Fix 1 (minimum instances). Budget $600–$3,600/month per warm worker depending on hardware.

If your cold start fix cost exceeds your current Replicate invoice: evaluate alternatives. At that point, platforms with better cold start economics for custom containers (Modal, fal.ai real-time) may be the better call.

Frequently Asked Questions

Why does Replicate have long cold starts?

Replicate scales custom model deployments to zero when idle to reduce cost. When a new request arrives, the container must boot, initialize CUDA, and load model weights into GPU VRAM - a process that takes 30–120 seconds for large custom models. Public models on Replicate do not have this problem because they are kept warm by Replicate's infrastructure.

Does Replicate charge for cold start time?

No. Replicate only charges per second of active GPU compute. Queue time and cold start wait time are not billed. However, users still experience the wait time even if you are not charged for it.

How do I keep a Replicate model warm?

Use Replicate's Deployments product and set minimum instances to at least 1 in the deployment settings. This keeps one worker always running and ready. A warm T4 instance costs approximately $0.81/hour ($19.44/day) on Replicate's per-second billing.

What is the cheapest way to eliminate cold starts on Replicate?

Use public models (SDXL, Flux Schnell, Flux Dev) instead of custom deployments - they are kept warm by Replicate with no additional cost. If you need custom containers, setting min-instances to 1 with a T4 GPU ($19.44/day) is the cheapest always-warm option.

Are Replicate fine-tuned models affected by cold starts?

Models created through Replicate's Trainings (fine-tuning) product were significantly improved in 2024 to have sub-1-second cold starts, according to Replicate's blog. Custom Docker containers are a separate case and still experience 30–120 second cold starts.

How much does it cost to keep a Replicate model warm 24/7?

At Replicate's per-second billing: a T4 costs $0.000225/sec x 86,400 = $19.44/day ($583/month). An A100 80GB costs $0.001400/sec x 86,400 = $120.96/day ($3,629/month). A warm L40S runs $84.24/day. Choose the minimum hardware that handles your generation latency requirements.

What is the difference between Replicate Predictions and Deployments?

Predictions are one-off inference calls against public or private models. Deployments are persistent, configurable model endpoints with min/max instance controls, autoscaling settings, and dedicated URLs. Deployments are the right tool for production APIs where you need warm instances and consistent latency. The Predictions endpoint does not support minimum instance configuration.

When should I consider switching from Replicate to a different provider?

Consider switching when: (1) your cold start fix cost (min-instances) exceeds your generation cost - this happens when always-warm A100 at $3,629/month is more than you spend on predictions; (2) you need custom ComfyUI workflow orchestration not supported by Replicate's container model; (3) you need lower per-second rates for high-volume sustained workloads. RunPod serverless, fal.ai, and Modal are the most common alternatives.