About to deploy a model? Know exactly how many GPUs you need before you go live.
Enter your model, GPU type, target throughput, average output tokens, and headroom. Get the exact GPU count, cost per request, and a break-even comparison against serverless APIs — so you deploy with confidence, not guesswork.
Model & Hardware
⚠ This GPU cannot fit the selected model — choose a larger GPU.
Traffic Profile
RPS
tok
GPUs Required at Peak Based on a throughput baseline of ~1,200 output tok/s on a single A100-80 running a 7B model with vLLM. Scaled by GPU compute factor, model efficiency, engine multiplier, and headroom.
Multi-GPU scaling is sublinear (1.75× for 2 GPUs, 3.2× for 4). Actual throughput will vary with your workload shape — benchmark before committing.
—
—
Assumes a flat 24/7 traffic profile. Real workloads vary by time of day —
pattern-based predictive autoscaling (on our roadmap)
learns your traffic patterns from scanner history and pre-warms replicas before spikes hit,
reducing costs during off-peak without degradation during ramp-up.
—
Replicas needed
—
Output tokens / sec capacity
—
Est. annual GPU cost Based on on-demand hourly rates for the selected provider. Reserved or spot pricing can reduce this by 30–70%.
—
Cost per 1M output tokens Annual GPU cost divided by annual output token volume at your stated RPS. Assumes 24/7 operation.
This improves as your utilization increases — idle GPUs raise the effective cost per token.
Scale-Out Projections
Traffic Scenario
RPS
Replicas
Total GPUs
Annual Cost
Cost / 1M tokens
Self-Hosted vs. Serverless API API prices are approximate public rates ($/1M output tokens) sourced from provider pricing pages. Prices change frequently — verify before making infrastructure decisions.
Self-hosted cost is on-demand GPU pricing. At reserved pricing self-hosting becomes cheaper at lower traffic volumes.
Option
Monthly Cost
Annual Cost
Cost / 1M tokens
vs. Self-Hosted
Want an exact capacity plan for your workload? Enter your email and we'll send a detailed plan with benchmarked throughput numbers for your model and GPU combo.
✓ Got it — we'll be in touch.
Something went wrong — email us at info@paralleliq.ai
Paralleliq Scanner (piqc) scans your Kubernetes cluster in seconds. No agents, no instrumentation, nothing changes in your cluster.
* Throughput estimates assume vLLM-style continuous batching at ~70% GPU utilization. Actual numbers vary
by batch size, sequence length, and quantization. Benchmark your specific workload before committing to hardware.
GPU prices are on-demand estimates and may differ from reserved/spot pricing.