AI inference pricing seems simple until you try to model your costs before shipping. Per-call pricing, per-second billing, per-hour GPU rental, and token-based pricing are four different models that produce wildly different numbers at the same throughput. A team that ships without understanding which model applies to them regularly discovers they are spending 3–10x what they expected.
This guide explains how each pricing model works, what it means for image generation specifically, how to calculate total cost of ownership (including the costs that do not appear on the invoice), and how to pick the right model for your workload.
The Four AI Inference Pricing Models
1. Per-call (per-image) pricing
You pay a fixed price per API call, regardless of how long the inference takes. This is the simplest pricing model to reason about. "Background removal: $0.05 per image. 10,000 images = $500." Done.
Per-call pricing is common for managed pipeline APIs where the provider has optimized the inference path enough to offer a predictable margin. The risk is that the fixed price does not reflect actual compute: a slow run costs the same as a fast one from your perspective, but providers price these with margin built in. At high volume, per-second billing often becomes cheaper.
For example, Runflow prices per pipeline execution: a single call that runs background removal, then upscaling, then format conversion counts as one billable unit, not three separate API calls. This makes multi-step pipelines predictable to price in a way that per-second billing across multiple separate APIs is not.
2. Per-second billing
You pay for the actual GPU time consumed by your inference call, measured in seconds. Replicate is the canonical example: if your Flux Dev generation takes 12 seconds on an A40, you pay 12 × (A40 per-second rate). This aligns your cost with actual compute usage but makes cost estimation harder - generation time varies by prompt complexity, image dimensions, and inference steps.
Per-second billing also means cold starts are billed. If your first request triggers a 30-second model load before inference begins, you pay for those 30 seconds at the same rate. On Replicate, cold start time is part of your compute bill.
3. Per-hour GPU rental
You rent a GPU for a period of time and pay by the hour, regardless of whether you are running inference. This is the model for GPU cloud providers like RunPod, Vast.ai, Lambda, and CoreWeave. You get maximum flexibility - any model, any configuration, full root access - but pay for the GPU whether it is processing requests or idle.
Per-hour billing is cost-efficient when utilization is high and consistent. If your image pipeline runs at 80%+ GPU utilization continuously, per-hour pricing is typically cheaper than per-call or per-second APIs. If you have bursty, unpredictable workloads, you pay for idle time, which quickly negates the unit economics advantage.
4. Token-based pricing (text models - less relevant for images)
Token pricing is the standard for language model APIs (OpenAI, Anthropic, etc.) and applies to some multimodal models. For pure image generation, token pricing is rare - most image APIs use per-call or per-second models. DALL-E 3 from OpenAI is priced per image at fixed resolution tiers, which is effectively per-call pricing under a different name.
Pricing Model Comparison at Common Volumes
| Model | Mechanism | Cost at 10K images/month | Cost at 100K images/month | Best workload fit |
|---|---|---|---|---|
| Per-call (managed API) | Fixed $/image | ~$200–$500 | ~$1,500–$4,000 | Predictable, bursty workloads |
| Per-second (Replicate A40) | ~$0.00115/sec × ~12sec = ~$0.014/img | ~$140 | ~$1,400 | Consistent volume, known generation time |
| Per-hour GPU rental (A40, RunPod) | GPU hour regardless of use | Depends on utilization - $100–$400 | $500–$1,500 at high utilization | High, consistent volume - 60%+ GPU use |
| Dedicated endpoint (HuggingFace) | GPU hour (running or not) | Single A40: ~$480/month flat | ~$480–$1,500 (scale endpoint) | Always-on, predictable latency |
The numbers above are approximations. Real costs depend on your specific model, image resolution, inference steps, and provider. Use the GPU Cost Calculator at /tools/gpu-cost-calculator to model your actual scenario.
The Hidden Costs That Do Not Appear on the Invoice
The invoice cost is only part of what inference actually costs your business. Teams that optimize only for per-image API cost often undercount the full picture.
| Cost component | Per-call API | GPU rental + self-hosted | Managed pipeline platform |
|---|---|---|---|
| GPU compute | Included in per-call price | GPU rental cost | Included in per-call price |
| Model maintenance | None - provider handles | Weight downloads, updates, compatibility testing | None |
| Scaling / autoscaling | None - provider handles | Engineer time to build + maintain | None |
| Cold start management | Varies by provider | Your responsibility | Platform handles |
| Output quality monitoring | None (you build or skip) | You build or skip | Included (e.g., Sentinel on Runflow) |
| On-call engineering | None | Someone is responsible at 3am | None |
| GPU instance management | None | Instance selection, spot interruptions, restarts | None |
The "on-call engineering" line is the one that surprises teams most. Self-hosted GPU inference means someone is responsible when the server goes down at 3am. For a 5-person startup, that cost is real even if it never appears on a spreadsheet: it is the senior engineer who cannot take vacation without putting someone on GPU babysitting duty.
For a detailed breakdown of self-hosted vs managed total cost of ownership, see /cost/self-hosted-stable-diffusion-total-cost-of-ownership.
Cold Starts: The Invisible Cost in Per-Second Billing
Cold start cost is the most commonly underestimated line item in AI inference budgets. A cold start occurs when a model has been unloaded from GPU memory and must be reloaded before the next request. On serverless inference platforms, this happens whenever a model has been idle long enough for the provider to reclaim the GPU.
On per-second billing (Replicate), cold start time is directly billed. A 30-second cold start at an A40 rate adds ~$0.035 to the cost of that request - more than the inference itself for fast models.
On per-call APIs, cold starts affect latency but not your invoice. The provider absorbs the cost but passes it back as response time variance. If your product requires consistent latency, a provider with warm model pools is worth the potential per-call premium.
| Provider type | Cold start latency | Cold start billing | Mitigation |
|---|---|---|---|
| HuggingFace free tier | 30–120 seconds | Not billed (shared queue) | Use PRO or Dedicated Endpoints |
| Replicate | 10–45 seconds | Billed as compute time | Replicate Deployments (keep warm) |
| fal.ai | 2–10 seconds typical | Billed as compute time | Lower by architecture vs Replicate |
| GPU rental (RunPod, Vast.ai) | First load only (keep warm yourself) | Included in hourly rate | Keep instance running |
| Runflow | Minimal - warm pool | Not separately billed | Architecture handles it |
For measured cold start data across providers, see the benchmark at /deploy/gpu-cold-start-benchmarks. For a detailed explanation of what causes cold starts and how different architectures handle them, see /deploy/replicate-cold-start-fix.
Estimating Your Monthly Inference Cost
A simple framework for cost estimation before you commit to a provider:
1. Measure your expected volume: images per day, peak vs average, burst patterns. 2. Estimate your generation time: most providers publish benchmark latencies for common models. 3. Choose a billing model: per-call if volume is bursty; per-second if volume is steady; per-hour rental only if you can guarantee 60%+ GPU utilization. 4. Add the hidden costs: is anyone managing infrastructure? Is quality monitoring needed? 5. Run the number at 3× your expected volume: most products grow faster than planned.
The GPU Cost Calculator at /tools/gpu-cost-calculator automates steps 1–3 and lets you compare per-call, per-second, and per-hour billing for your specific model and volume.
When Each Pricing Model Makes Sense
Per-call pricing - right for:
Teams that want cost predictability above all else. Bursty workloads where you cannot predict volume day to day. Products in early growth where volume is low and invoice simplicity matters more than unit economics. Multi-step pipelines where a single per-call price covers the full chain.
Per-second billing - right for:
Teams comfortable modeling cost as (generation_time × rate). Stable, predictable volume where you know your average generation time. Use cases where model choice matters enough to pay for per-second flexibility.
Per-hour GPU rental - right for:
High-volume teams (50,000+ images per month) with consistent throughput. Teams with engineering capacity to manage GPU infrastructure. Custom model requirements not available on managed platforms. Never the right choice for teams that cannot guarantee sustained GPU utilization.
Spot Instances: Lower Cost, Higher Risk
GPU cloud providers (RunPod, Vast.ai, Lambda) offer spot instances: unused GPU capacity at discounts of 30-70% versus on-demand pricing. The trade-off is interruption risk - the provider can reclaim the GPU with short notice (typically 30-60 seconds) if the hardware is needed for a higher-priority job. For batch processing (generating a catalog of 10,000 product images overnight), spot instances are a strong cost optimization. For real-time user-facing generation, interruptions cause failed requests that require retry logic and may still cause visible latency spikes. Most teams use spot for batch and on-demand for real-time.
Batching: The Most Underused Cost Reduction
Batching means grouping multiple image generation requests into a single GPU inference call. Instead of calling the model 10 times with 1 image each, you call it once with a batch of 10 images. The GPU executes these in parallel, and total time is only marginally longer than a single call - but you pay for one call's worth of compute, not ten. On per-second billing, batching directly reduces cost by the batch size factor. On per-call billing, batching may not be available or may count each batch as one call regardless.
The practical constraint on batching is VRAM. Each additional image in a batch increases GPU memory requirements. On a 24GB GPU (RTX 3090), you can typically batch 4-8 images at 1024x1024 with SDXL. Exceeding available VRAM causes OOM (out of memory) errors. The optimal batch size requires experimentation against your specific GPU and resolution. For a guide to VRAM requirements by model and resolution, see the VRAM Calculator at /tools/vram-calculator.