Inference Capacity Planner

Model & Hardware

Model Size

GPU Type

⚠ This GPU cannot fit the selected model — choose a larger GPU.

Inference Engine Throughput multiplier vs vLLM baseline:
· vLLM — 1.0× (continuous batching)
· TGI — 0.85× (~15% slower)
· TensorRT-LLM — 1.35× (~35% faster)

Choose the engine you plan to deploy. TensorRT-LLM is fastest but requires compilation per model.

Cloud Provider

Traffic Profile

Requests / Second (peak) Use your expected peak traffic, not average. Inference clusters must handle bursts — sizing for average load will cause failures during spikes.

If you don't know your peak yet, use 2–3× your expected average as a starting point.

RPS

Avg Output Tokens (per request) Output tokens (generated) drive throughput demand — each token requires a GPU forward pass.

Input/prompt tokens affect VRAM via KV cache but are processed faster. Typical ranges:
· Chat: 100–300 tokens
· Summarization: 200–500 tokens
· Code generation: 300–1000 tokens

tok

Spare Capacity (headroom) Buffer above your peak load so traffic spikes don't saturate the cluster:
· 10% — lean, suitable for stable/predictable traffic
· 25% — standard production recommendation
· 40% — conservative, for unpredictable workloads
· 50% — high availability / mission-critical

At 25% headroom your cluster runs at 75% max utilization at peak.

GPUs Required at Peak Based on a throughput baseline of ~1,200 output tok/s on a single A100-80 running a 7B model with vLLM. Scaled by GPU compute factor, model efficiency, engine multiplier, and headroom.

Multi-GPU scaling is sublinear (1.75× for 2 GPUs, 3.2× for 4). Actual throughput will vary with your workload shape — benchmark before committing.

—

Assumes a flat 24/7 traffic profile. Real workloads vary by time of day — pattern-based predictive autoscaling (on our roadmap) learns your traffic patterns from scanner history and pre-warms replicas before spikes hit, reducing costs during off-peak without degradation during ramp-up.

—

Replicas needed

—

Output tokens / sec capacity

—

Est. annual GPU cost Based on on-demand hourly rates for the selected provider. Reserved or spot pricing can reduce this by 30–70%.

—

Cost per 1M output tokens Annual GPU cost divided by annual output token volume at your stated RPS. Assumes 24/7 operation.

This improves as your utilization increases — idle GPUs raise the effective cost per token.

Scale-Out Projections

Traffic Scenario	RPS	Replicas	Total GPUs	Annual Cost	Cost / 1M tokens

Self-Hosted vs. Serverless API API prices are approximate public rates ($/1M output tokens) sourced from provider pricing pages. Prices change frequently — verify before making infrastructure decisions.

Self-hosted cost is on-demand GPU pricing. At reserved pricing self-hosting becomes cheaper at lower traffic volumes.

Option	Monthly Cost	Annual Cost	Cost / 1M tokens	vs. Self-Hosted

Want an exact capacity plan for your workload?
Enter your email and we'll send a detailed plan with benchmarked throughput numbers for your model and GPU combo.

✓ Got it — we'll be in touch.

Something went wrong — email us at info@paralleliq.ai

Paralleliq Scanner (piqc) scans your Kubernetes cluster in seconds.
No agents, no instrumentation, nothing changes in your cluster.

* Throughput estimates assume vLLM-style continuous batching at ~70% GPU utilization. Actual numbers vary by batch size, sequence length, and quantization. Benchmark your specific workload before committing to hardware. GPU prices are on-demand estimates and may differ from reserved/spot pricing.