Skip to main content
Back to Tools

Self Hosted AI Hardware CalculatorVRAM, GPU, power, and cloud breakeven months for local LLMs

Pick a model size, quantization, context length, and target. Get a real hardware recommendation in a single screen. No login, no sales calls, no fake numbers - the math runs in your browser using the same formulas you would write by hand from the llama.cpp and vLLM source notes.

Updated May 26, 2026 - By Jim Liu(Jim Liu has built homelab rigs for Llama, Mistral, and Qwen across consumer GPUs, Apple Silicon, and dual-3090 setups)

TL;DR - what the calculator gives you

  • A real VRAM number - model weights plus KV cache plus activation overhead, computed live in the browser using model-specific layer and head dimensions, not a rule of thumb.
  • A named GPU pick - RTX 3090, RTX 4090, RTX 6000 Ada, H100 80GB, or a multi-GPU stack of 3090s. Apple Silicon SKUs (M3 Pro, M3 Max, M3 Ultra) when you pick unified memory.
  • Power and PSU sizing - total system draw plus a 1.3x headroom rounded up to a real PSU size you can buy.
  • A cloud breakeven number - DIY hardware cost divided by monthly savings vs a rented GPU instance of similar VRAM. Useful for the "buy or rent" decision.
  • A warning when your config does not fit - not silently downgraded. CPU-only above 30B, single-GPU above 48 GB, or 6x card stacks all surface a clear warning instead of a misleading number.

Self-hosted AI hardware calculator

Pick a model size, quantization, context length, and target. Get VRAM, system RAM, GPU pick, power draw, and cloud breakeven months. Live recompute, no Calculate button.

1102550

For Llama 70B Q4 you need an RTX 6000 Ada. DIY breaks even in 16 months vs renting cloud GPU.

Minimum VRAM

44.5 GB

weights 36.5 + KV 2.5 + overhead

Recommended hardware

RTX 6000 Ada

48 GB VRAM per card

System RAM

67 GB

1.5x VRAM for model load + OS

Power draw

450 W

Recommended PSU: 650 W (80+ Gold)

DIY hardware cost

$8,600

GPU + host build (CPU, board, RAM, PSU, case)

Cloud breakeven

16 months

Equiv cloud GPU: $600/mo

Curated configurations — click to load

Presets are real-world configurations from r/LocalLLaMA homelab posts (May 2026 reference). GPU prices, Apple Silicon SKUs, and cloud rates are reviewed quarterly.

Next step

Have a target hardware in mind? Compare the inference quality of the models you can fit on it, then decide whether self-hosting beats a hosted API for your workload.

Why use a self hosted AI hardware calculator

The popular rule of thumb "you need 2 bytes per parameter at FP16" is correct as far as it goes - a 7B FP16 model weighs about 14 GB. What that rule misses is everything that also lives in VRAM: the KV cache that grows linearly with context length and concurrent users, the activation memory during prompt processing, and the CUDA workspace the runtime carves out. By the time you load a 70B Q4 model with a 32K context and serve five concurrent sessions, the "2 bytes per parameter" rule is off by 30-40 GB. People who follow that rule order a 4090, find that the model OOMs the first time anyone sends a long prompt, and end up reselling the card.

The calculator above does the full math. Model weights use the real bytes-per-parameter for the quantization scheme you picked (Q4_K_M includes its per-block metadata, FP16 does not). KV cache uses the actual num_layers and num_kv_heads for that model family, not a generic average. Concurrent users multiply the KV term. Overhead scales with parameter count. The number it returns is the working set you actually need to fit on the card, not the model file size.

The same logic applies in reverse to cloud breakeven. A common mistake is comparing the sticker price of an RTX 4090 to the hourly cost of an A100 instance and concluding cloud is always cheaper. In practice the 4090 lasts three years, draws power at most eight hours a day for a hobbyist, and the cloud A100 you would actually rent is closer to 600 dollars per month always-on. Run the numbers in the tool and the breakeven for a 4090 is typically 4-8 months at moderate use. That is the answer that lets you make a real buy-or-rent decision.

How VRAM is sized: the three terms

Self-hosted inference VRAM is the sum of three terms. The largest is model weights: number of parameters times bytes per parameter. For a Q4_K_M GGUF that is roughly 0.56 bytes per param (4 bits plus per-block scale and zero metadata), so 70 billion params lands near 39 GB. For FP16 that is exactly 2 bytes per param, so the same 70B model is 140 GB - which is why nobody runs 70B at FP16 outside a datacenter.

The second term is the KV cache. Every forward pass writes a key and value vector for each token, in every attention layer, for every attention head. The memory needed is 2 (K and V) x num_layers x num_kv_heads x head_dim x context_length x 2 bytes. Most modern models use grouped-query attention with 8 KV heads regardless of total head count, which keeps the cache surprisingly small - Llama 70B at 8K context is only about 2.5 GB of KV per session. Multi-session deployments multiply this term by concurrent users, which is where multi-user servers blow past the "just buy a 4090" answer.

The third term is activation and CUDA workspace overhead. During prompt processing, the model materializes intermediate tensors that vanish at the end of the forward pass. The calculator models this as a flat 2 GB plus 5 percent of the parameter count - small for a 7B, meaningful for a 70B. This is the term most rule of thumb estimates skip, and it is the term that turns a "close to fit" configuration into an OOM at runtime.

Worked example: Llama 70B Q4_K_M, 8K context, 1 user

  • Weights: 70 x 1e9 x 0.56 bytes = ~39.2 GB
  • KV cache: 2 x 80 layers x 8 KV heads x 128 head_dim x 8192 ctx x 2 bytes = ~2.5 GB
  • Overhead: 2 GB + 5% of 70 = ~5.5 GB
  • Total: ~47.2 GB - just above a 48 GB RTX 6000 Ada, comfortably below an 80 GB H100 or A100 80GB.

Quantization: FP16 vs INT8 vs Q4_K_M

SchemeBytes per param7B file sizeQuality loss vs FP16Typical use
FP162.00~14 GBbaselineDatacenter, full quality
INT81.00~7 GBnegligibleProduction single-GPU, low risk
Q8_01.06~7.4 GBalmost nonellama.cpp near-lossless
Q5_K_M0.70~4.9 GBsmall (1-2% on benchmarks)llama.cpp balanced default
Q4_K_M0.56~3.9 GBmoderate (2-4% on benchmarks)llama.cpp tight fit
INT40.50~3.5 GBnoticeable on reasoningAggressive consumer fit

Q4_K_M and Q5_K_M are GGUF formats from llama.cpp that use mixed precision within each tensor block - the "K" means K-quants, "M" means medium-size variant. The metadata overhead is small but real, which is why the calculator uses effective bytes per param (0.56 for Q4_K_M, 0.70 for Q5_K_M) rather than the nominal 0.5 / 0.625. The honest answer for "which quantization to pick" is: start at Q5_K_M, drop to Q4_K_M only if you cannot fit, and never go below Q4 for reasoning-heavy tasks.

GPU selection: 3090, 4090, RTX 6000, H100

The calculator picks consumer cards first because that is where the price-per-VRAM curve is most generous. An RTX 3090 24 GB at street price (around 850 USD used in May 2026) is still the best dollar-per-GB GPU you can put in a consumer build. The 4090 jumps you to roughly 2.2x the cost for the same 24 GB but with 60-70 percent more tokens per second on most workloads, so the right way to choose between them is throughput-bound vs budget-bound, not VRAM-bound.

Above 24 GB the consumer market goes thin: the RTX 5090 ships with 32 GB, the RTX 6000 Ada has 48 GB at prosumer pricing (about 7,000 USD), and beyond that you are in datacenter SKUs - A100 40/80 GB, H100 80 GB - at 9,000 to 28,000 USD per card. The calculator only routes you to datacenter cards when single-GPU fit demands more than 48 GB. Below that ceiling, the answer is almost always "stack two 3090s on a board with x8/x8 PCIe lanes" for the price of one A100.

Multi-GPU has two non-obvious costs. The first is PCIe lane sharing: if your motherboard splits to x4/x4 because of an NVMe in the wrong slot, inference can lose 20-30 percent of its throughput compared to x8/x8. The second is power and case clearance: two 350 W 3090s need either a 1200 W PSU and an open-frame case or a dual-PSU build. The calculator factors in the extra 300 USD of riser cards and splitter cables that a real multi-GPU build needs, which is the cost most YouTube build-logs gloss over.

Apple Silicon unified memory vs discrete GPU

Apple Silicon is the only sane way to put 128 GB of GPU-accessible memory in a single machine without datacenter parts. An M3 Max 128 GB or M3 Ultra 192 GB lets you load a 70B FP16 model into unified memory, where the CPU and GPU share the same address space. The trade is throughput: Llama 7B Q4 hits roughly 100 tokens per second on a 4090, around 50 on an M3 Max, and 35 on an M3 Pro. Prompt processing (the time before the first token appears) is where Apple lags hardest, especially on long prompts.

The calculator routes to Apple Silicon when you pick that target, and ladders up through M3, M3 Pro, M3 Max, and M3 Ultra based on the unified memory you need. It leaves a 25 percent headroom because macOS reserves some of the unified pool for the OS and other processes - the 64 GB M3 Max only gives you about 48 GB of usable model memory under normal conditions. Apple ships a kernel boot argument (iogpu.wired_limit_mb) that raises the ceiling, but the calculator uses the out-of-the-box limit so the number it returns matches what your machine actually does.

Pick Apple Silicon for: interactive single-user, laptop portability, idle power, and anything above 64 GB without a server room. Pick a discrete GPU for: production serving, long-prompt latency, batch workloads, fine-tuning, and anything where you measure success in tokens per second per dollar.

DIY vs cloud: when breakeven flips

The breakeven number is the most useful single output of the calculator and also the one most likely to surprise people. If you run an RTX 4090 ten percent of the time - call it three hours per day - the breakeven against a rented A6000 sits around 7-9 months. Below that usage, cloud wins; above it, DIY wins. The mistake people make is comparing peak rent to peak buy and forgetting that they will never actually run the cloud instance 24x7.

Three real-world adjustments the tool does not capture: cards lose 30-40 percent of their resale value per year, a homelab rig has incidental costs (UPS, networking, ventilation) that add 5-10 percent to the headline number, and your time setting up and maintaining the rig is not free. Even with those adjustments, the breakeven advantage for self-hosted is significant if you run inference for more than three hours per day - which most agents, code assistants, and bulk batch jobs easily exceed.

The other axis the breakeven misses is privacy and offline access. If your prompts contain customer data, source code, or anything you cannot ship to a third-party API, self-hosted is not a cost decision, it is the only option. Run the calculator with your model and context length, see the hardware bill, and treat the cloud breakeven as a sanity check on the total cost of ownership rather than the headline number.

What real homelab builders say

I went dual 3090 for Mixtral expecting smooth sailing - the bottleneck turned out to be PCIe lane sharing. If your board only gives you x8/x8 you lose maybe 5 percent on inference. If it drops to x4/x4 because you have an NVMe in the wrong slot, you lose 20-30 percent. Check the board manual before you buy a second card.
- r/LocalLLaMA - "Dual GPU bottlenecks"
M3 Max 64 GB has been my daily for 6 months running Llama 3 70B at Q3_K_M. 8-12 tokens per second is plenty for chat. The killer feature is closing the lid - no GPU rig idles that quietly.
- r/LocalLLaMA - "Apple Silicon vs GPU"
I ran the numbers on cloud GPU vs buying a 3090 used. At my usage (about 4 hours per day prompt-heavy) the 3090 paid for itself in 5 months. The number people quote of "cloud is always cheaper" assumes you only run jobs occasionally.
- r/LocalLLaMA - "Cloud vs DIY economics"

Frequently asked questions

How does the self hosted AI hardware calculator size VRAM?
It adds three things: model weight memory (parameters times bytes per param for your quantization), KV cache (2 x num_layers x num_kv_heads x head_dim x context_length x 2 bytes, multiplied by concurrent users), and a small activation overhead that scales mildly with model size. For Llama 70B Q4_K_M at 8K with one user that is roughly 39 GB weights plus 2.5 GB KV plus 5 GB overhead.
Why does the calculator pick an RTX 4090 for Llama 70B Q4 instead of an RTX 3090?
Q4_K_M uses per-block quantization with metadata that pushes the effective bit width to about 4.5, so 70B weights land near 39 GB. Once you add KV cache and overhead, the working set is above the 24 GB on both cards. The tool stays in the safe range where the model fully resides on GPU.
How accurate is the cloud breakeven months number?
It is a ballpark, not a quote. Cloud rate is the median always-on price for an equivalent VRAM instance on community spot markets (Lambda, RunPod, Vast). DIY cost is GPU plus a reasonable host build. Electricity at $0.13/kWh and full TDP slightly overstates real cost. Anything under 12 months strongly favors DIY; over 36 months favors cloud.
When should I pick Apple Silicon unified memory over a discrete GPU?
Apple Silicon wins on idle power, portability, and memory ceiling above 48 GB. An RTX 4090 wins on tokens-per-second throughput and prompt-processing speed. Pick Apple for interactive single-user; pick discrete GPU for batch or multi-user serving.
Does the calculator account for prompt processing memory separately?
The 2 GB plus 5 percent of params overhead covers activation memory during a single forward pass at typical context lengths. Prompt processing is compute-bound, not memory-bound; weights plus KV cache set the VRAM ceiling. For 64K+ tokens on a small model bump one GPU tier.
Can I run a 70B model on CPU only?
Technically yes with 64 GB RAM and a Q4 GGUF, but you get under one token per second. The calculator warns above 30B for CPU-only because the wall-clock latency is impractical for interactive work. Below 13B at Q4, CPU-only is fine for dev workflows.
How does concurrent users affect the VRAM number?
KV cache is per session, so concurrent users multiply it linearly. Ten users at 32K context on a 7B model is roughly 40 GB of KV cache alone. Paged-attention engines (vLLM, TensorRT-LLM) share KV pages tighter in production; the linear estimate here is the safe homelab ceiling.
Why do you recommend an 850 watt PSU when the GPU is only 450 watts?
CPU, RAM, board, drives, and fans add 100-200 W under load, and Ada-generation GPUs hit twice their TDP on transient spikes. A PSU sized at 1.3x sustained leaves headroom for spikes. The tool rounds up to standard PSU sizes (550, 650, 750, 850, 1000, 1200, 1500) so you can buy a real product.
Sponsored

Ad served by Adsterra. OpenAIToolsHub is not responsible for advertiser content.