How does the self hosted AI hardware calculator size VRAM?

It adds three things: model weight memory (number of parameters multiplied by bytes per parameter for your chosen quantization), KV cache (2 times num_layers times num_kv_heads times head_dim times context_length times 2 bytes, multiplied by concurrent users), and a small activation and CUDA workspace overhead that scales mildly with model size. For Llama 70B Q4_K_M at 8K context with one user, that is roughly 39 GB weights plus 2.5 GB KV plus 5 GB overhead, which is why the tool routes you to a 48 GB RTX 6000 Ada or above instead of trying to fit on a 24 GB RTX 4090.

Why does the calculator pick an RTX 4090 for Llama 70B Q4 instead of an RTX 3090?

Q4_K_M is not a flat 4 bits per weight. It uses per-block quantization with metadata that pushes the effective bit width to about 4.5, so 70B parameters land near 39 GB just for weights. Once you add KV cache and CUDA overhead the working set is above the 24 GB the 3090 and 4090 both have. People do run 70B Q4 on a single 24 GB card by aggressively trimming context and accepting some weights spilling to CPU, but the calculator stays in the safe range where the model fully resides on GPU and inference runs at the GPU memory bandwidth.

How accurate is the cloud breakeven months number?

It is a rough comparison, not a procurement quote. The cloud rate is the median always-on price for an equivalent VRAM-class instance on community spot markets (Lambda, RunPod, Vast) as of May 2026. DIY hardware cost is the GPU plus a reasonable host rig (CPU, board, RAM, PSU, case). Electricity is computed at $0.13 per kWh and the full TDP, which overstates costs slightly because real workloads do not draw full TDP 24x7. Treat the breakeven number as ballpark — anything under 12 months strongly favors DIY, anything over 36 favors cloud.

When should I pick Apple Silicon unified memory over a discrete GPU?

Apple Silicon wins on three axes: idle power (M3 Max sips ~10 watts at idle vs 50 watts for an RTX 4090), portability (M3 Max in a laptop runs a 13B model on battery), and memory ceiling per dollar above 48 GB (an M3 Ultra 192 GB is one machine, not a 4x 3090 build). The trade-off is throughput: an RTX 4090 puts out around 100 tokens per second on Llama 7B Q4 while an M3 Max sits closer to 35-50 tokens per second, and a 4090 wins on prompt-processing speed by 3-5x. If you are interactive single-user, Apple is great. If you batch or serve concurrent users, GPU wins.

Does the calculator account for prompt processing memory separately?

The 2 GB plus 5 percent of parameter count overhead term covers activation memory during a single forward pass — large enough for typical prompt-processing peaks at the chosen context length. Prompt processing is compute-bound, not memory-bound, so the VRAM ceiling is set by weights plus KV cache. If you intend to run extremely long prompts (above 64K tokens) on a small model the calculator may under-allocate by 1-2 GB; bump up one GPU tier in that case.

Can I run a 70B model on CPU only?

Technically yes — llama.cpp will happily load a 70B Q4_K_M GGUF (~40 GB) into 64 GB of system RAM and decode tokens. Practically you get about half a token per second per modern CPU core, which is unusable for interactive work. The calculator warns when you ask for CPU only above 30B for exactly this reason: the math fits, but the wall-clock latency is impractical. Below 13B and Q4 quantization, CPU-only is fine for single-user dev workflows.

How does concurrent users affect the VRAM number?

KV cache is per-session, so concurrent users multiply that term linearly. With Llama 7B at 32K context, one user is around 4 GB of KV cache; ten users is 40 GB. That is why the calculator pushes you off a single 4090 the moment you raise the slider above 4-5 concurrent users at long context — the model weights are shared but each user owns their own KV. Production deployments use paged-attention (vLLM, TensorRT-LLM) to share KV pages and pack the working set tighter, but for sizing a homelab the linear estimate is the safer ceiling.

Why do you recommend an 850 watt PSU when the GPU is only 450 watts?

Two reasons. First, the CPU, RAM, board, drives, and fans add 100-200 watts under load, so a 4090 system pulls 600-650 watts at peak. Second, transient spikes on Ada-generation GPUs hit twice their TDP for milliseconds; a PSU sized at 1.3x sustained load leaves headroom that prevents the PSU from shutting down under spikes. The calculator rounds up to the next standard PSU size (550, 650, 750, 850, 1000, 1200, 1500, 1600) so you can buy a real product.

Back to Tools

Self Hosted AI Hardware CalculatorVRAM, GPU, power, and cloud breakeven months for local LLMs

Pick a model size, quantization, context length, and target. Get a real hardware recommendation in a single screen. No login, no sales calls, no fake numbers - the math runs in your browser using the same formulas you would write by hand from the llama.cpp and vLLM source notes.

Updated May 26, 2026 - By Jim Liu(Jim Liu has built homelab rigs for Llama, Mistral, and Qwen across consumer GPUs, Apple Silicon, and dual-3090 setups)

TL;DR - what the calculator gives you

A real VRAM number - model weights plus KV cache plus activation overhead, computed live in the browser using model-specific layer and head dimensions, not a rule of thumb.
A named GPU pick - RTX 3090, RTX 4090, RTX 6000 Ada, H100 80GB, or a multi-GPU stack of 3090s. Apple Silicon SKUs (M3 Pro, M3 Max, M3 Ultra) when you pick unified memory.
Power and PSU sizing - total system draw plus a 1.3x headroom rounded up to a real PSU size you can buy.
A cloud breakeven number - DIY hardware cost divided by monthly savings vs a rented GPU instance of similar VRAM. Useful for the "buy or rent" decision.
A warning when your config does not fit - not silently downgraded. CPU-only above 30B, single-GPU above 48 GB, or 6x card stacks all surface a clear warning instead of a misleading number.

Self-hosted AI hardware calculator

Pick a model size, quantization, context length, and target. Get VRAM, system RAM, GPU pick, power draw, and cloud breakeven months. Live recompute, no Calculate button.

Model size

Quantization

Context length

Inference target

Concurrent users1

1102550

For Llama 70B Q4 you need an RTX 6000 Ada. DIY breaks even in 16 months vs renting cloud GPU.

Minimum VRAM

44.5 GB

weights 36.5 + KV 2.5 + overhead

Recommended hardware

RTX 6000 Ada

48 GB VRAM per card

System RAM

67 GB

1.5x VRAM for model load + OS

Power draw

450 W

Recommended PSU: 650 W (80+ Gold)

DIY hardware cost

$8,600

GPU + host build (CPU, board, RAM, PSU, case)

Cloud breakeven

16 months

Equiv cloud GPU: $600/mo

Curated configurations — click to load

Presets are real-world configurations from r/LocalLLaMA homelab posts (May 2026 reference). GPU prices, Apple Silicon SKUs, and cloud rates are reviewed quarterly.

Next step

Have a target hardware in mind? Compare the inference quality of the models you can fit on it, then decide whether self-hosting beats a hosted API for your workload.

Compare models that fit your VRAM budget Check API token cost for the same workload

Why use a self hosted AI hardware calculator

The popular rule of thumb "you need 2 bytes per parameter at FP16" is correct as far as it goes - a 7B FP16 model weighs about 14 GB. What that rule misses is everything that also lives in VRAM: the KV cache that grows linearly with context length and concurrent users, the activation memory during prompt processing, and the CUDA workspace the runtime carves out. By the time you load a 70B Q4 model with a 32K context and serve five concurrent sessions, the "2 bytes per parameter" rule is off by 30-40 GB. People who follow that rule order a 4090, find that the model OOMs the first time anyone sends a long prompt, and end up reselling the card.

The calculator above does the full math. Model weights use the real bytes-per-parameter for the quantization scheme you picked (Q4_K_M includes its per-block metadata, FP16 does not). KV cache uses the actual num_layers and num_kv_heads for that model family, not a generic average. Concurrent users multiply the KV term. Overhead scales with parameter count. The number it returns is the working set you actually need to fit on the card, not the model file size.

The same logic applies in reverse to cloud breakeven. A common mistake is comparing the sticker price of an RTX 4090 to the hourly cost of an A100 instance and concluding cloud is always cheaper. In practice the 4090 lasts three years, draws power at most eight hours a day for a hobbyist, and the cloud A100 you would actually rent is closer to 600 dollars per month always-on. Run the numbers in the tool and the breakeven for a 4090 is typically 4-8 months at moderate use. That is the answer that lets you make a real buy-or-rent decision.

How VRAM is sized: the three terms

Self-hosted inference VRAM is the sum of three terms. The largest is model weights: number of parameters times bytes per parameter. For a Q4_K_M GGUF that is roughly 0.56 bytes per param (4 bits plus per-block scale and zero metadata), so 70 billion params lands near 39 GB. For FP16 that is exactly 2 bytes per param, so the same 70B model is 140 GB - which is why nobody runs 70B at FP16 outside a datacenter.

The second term is the KV cache. Every forward pass writes a key and value vector for each token, in every attention layer, for every attention head. The memory needed is 2 (K and V) x num_layers x num_kv_heads x head_dim x context_length x 2 bytes. Most modern models use grouped-query attention with 8 KV heads regardless of total head count, which keeps the cache surprisingly small - Llama 70B at 8K context is only about 2.5 GB of KV per session. Multi-session deployments multiply this term by concurrent users, which is where multi-user servers blow past the "just buy a 4090" answer.

The third term is activation and CUDA workspace overhead. During prompt processing, the model materializes intermediate tensors that vanish at the end of the forward pass. The calculator models this as a flat 2 GB plus 5 percent of the parameter count - small for a 7B, meaningful for a 70B. This is the term most rule of thumb estimates skip, and it is the term that turns a "close to fit" configuration into an OOM at runtime.

Worked example: Llama 70B Q4_K_M, 8K context, 1 user

Weights: 70 x 1e9 x 0.56 bytes = ~39.2 GB
KV cache: 2 x 80 layers x 8 KV heads x 128 head_dim x 8192 ctx x 2 bytes = ~2.5 GB
Overhead: 2 GB + 5% of 70 = ~5.5 GB
Total: ~47.2 GB - just above a 48 GB RTX 6000 Ada, comfortably below an 80 GB H100 or A100 80GB.

Quantization: FP16 vs INT8 vs Q4_K_M

Scheme	Bytes per param	7B file size	Quality loss vs FP16	Typical use
FP16	2.00	~14 GB	baseline	Datacenter, full quality
INT8	1.00	~7 GB	negligible	Production single-GPU, low risk
Q8_0	1.06	~7.4 GB	almost none	llama.cpp near-lossless
Q5_K_M	0.70	~4.9 GB	small (1-2% on benchmarks)	llama.cpp balanced default
Q4_K_M	0.56	~3.9 GB	moderate (2-4% on benchmarks)	llama.cpp tight fit
INT4	0.50	~3.5 GB	noticeable on reasoning	Aggressive consumer fit

Q4_K_M and Q5_K_M are GGUF formats from llama.cpp that use mixed precision within each tensor block - the "K" means K-quants, "M" means medium-size variant. The metadata overhead is small but real, which is why the calculator uses effective bytes per param (0.56 for Q4_K_M, 0.70 for Q5_K_M) rather than the nominal 0.5 / 0.625. The honest answer for "which quantization to pick" is: start at Q5_K_M, drop to Q4_K_M only if you cannot fit, and never go below Q4 for reasoning-heavy tasks.

GPU selection: 3090, 4090, RTX 6000, H100

The calculator picks consumer cards first because that is where the price-per-VRAM curve is most generous. An RTX 3090 24 GB at street price (around 850 USD used in May 2026) is still the best dollar-per-GB GPU you can put in a consumer build. The 4090 jumps you to roughly 2.2x the cost for the same 24 GB but with 60-70 percent more tokens per second on most workloads, so the right way to choose between them is throughput-bound vs budget-bound, not VRAM-bound.

Above 24 GB the consumer market goes thin: the RTX 5090 ships with 32 GB, the RTX 6000 Ada has 48 GB at prosumer pricing (about 7,000 USD), and beyond that you are in datacenter SKUs - A100 40/80 GB, H100 80 GB - at 9,000 to 28,000 USD per card. The calculator only routes you to datacenter cards when single-GPU fit demands more than 48 GB. Below that ceiling, the answer is almost always "stack two 3090s on a board with x8/x8 PCIe lanes" for the price of one A100.

Multi-GPU has two non-obvious costs. The first is PCIe lane sharing: if your motherboard splits to x4/x4 because of an NVMe in the wrong slot, inference can lose 20-30 percent of its throughput compared to x8/x8. The second is power and case clearance: two 350 W 3090s need either a 1200 W PSU and an open-frame case or a dual-PSU build. The calculator factors in the extra 300 USD of riser cards and splitter cables that a real multi-GPU build needs, which is the cost most YouTube build-logs gloss over.

Apple Silicon unified memory vs discrete GPU

Apple Silicon is the only sane way to put 128 GB of GPU-accessible memory in a single machine without datacenter parts. An M3 Max 128 GB or M3 Ultra 192 GB lets you load a 70B FP16 model into unified memory, where the CPU and GPU share the same address space. The trade is throughput: Llama 7B Q4 hits roughly 100 tokens per second on a 4090, around 50 on an M3 Max, and 35 on an M3 Pro. Prompt processing (the time before the first token appears) is where Apple lags hardest, especially on long prompts.

The calculator routes to Apple Silicon when you pick that target, and ladders up through M3, M3 Pro, M3 Max, and M3 Ultra based on the unified memory you need. It leaves a 25 percent headroom because macOS reserves some of the unified pool for the OS and other processes - the 64 GB M3 Max only gives you about 48 GB of usable model memory under normal conditions. Apple ships a kernel boot argument (iogpu.wired_limit_mb) that raises the ceiling, but the calculator uses the out-of-the-box limit so the number it returns matches what your machine actually does.

Pick Apple Silicon for: interactive single-user, laptop portability, idle power, and anything above 64 GB without a server room. Pick a discrete GPU for: production serving, long-prompt latency, batch workloads, fine-tuning, and anything where you measure success in tokens per second per dollar.

DIY vs cloud: when breakeven flips

The breakeven number is the most useful single output of the calculator and also the one most likely to surprise people. If you run an RTX 4090 ten percent of the time - call it three hours per day - the breakeven against a rented A6000 sits around 7-9 months. Below that usage, cloud wins; above it, DIY wins. The mistake people make is comparing peak rent to peak buy and forgetting that they will never actually run the cloud instance 24x7.

Three real-world adjustments the tool does not capture: cards lose 30-40 percent of their resale value per year, a homelab rig has incidental costs (UPS, networking, ventilation) that add 5-10 percent to the headline number, and your time setting up and maintaining the rig is not free. Even with those adjustments, the breakeven advantage for self-hosted is significant if you run inference for more than three hours per day - which most agents, code assistants, and bulk batch jobs easily exceed.

The other axis the breakeven misses is privacy and offline access. If your prompts contain customer data, source code, or anything you cannot ship to a third-party API, self-hosted is not a cost decision, it is the only option. Run the calculator with your model and context length, see the hardware bill, and treat the cloud breakeven as a sanity check on the total cost of ownership rather than the headline number.

What real homelab builders say

“I went dual 3090 for Mixtral expecting smooth sailing - the bottleneck turned out to be PCIe lane sharing. If your board only gives you x8/x8 you lose maybe 5 percent on inference. If it drops to x4/x4 because you have an NVMe in the wrong slot, you lose 20-30 percent. Check the board manual before you buy a second card.”

- r/LocalLLaMA - "Dual GPU bottlenecks"

“M3 Max 64 GB has been my daily for 6 months running Llama 3 70B at Q3_K_M. 8-12 tokens per second is plenty for chat. The killer feature is closing the lid - no GPU rig idles that quietly.”

- r/LocalLLaMA - "Apple Silicon vs GPU"

“I ran the numbers on cloud GPU vs buying a 3090 used. At my usage (about 4 hours per day prompt-heavy) the 3090 paid for itself in 5 months. The number people quote of "cloud is always cheaper" assumes you only run jobs occasionally.”

- r/LocalLLaMA - "Cloud vs DIY economics"

Frequently asked questions

How does the self hosted AI hardware calculator size VRAM?: It adds three things: model weight memory (parameters times bytes per param for your quantization), KV cache (2 x num_layers x num_kv_heads x head_dim x context_length x 2 bytes, multiplied by concurrent users), and a small activation overhead that scales mildly with model size. For Llama 70B Q4_K_M at 8K with one user that is roughly 39 GB weights plus 2.5 GB KV plus 5 GB overhead.
Why does the calculator pick an RTX 4090 for Llama 70B Q4 instead of an RTX 3090?: Q4_K_M uses per-block quantization with metadata that pushes the effective bit width to about 4.5, so 70B weights land near 39 GB. Once you add KV cache and overhead, the working set is above the 24 GB on both cards. The tool stays in the safe range where the model fully resides on GPU.
How accurate is the cloud breakeven months number?: It is a ballpark, not a quote. Cloud rate is the median always-on price for an equivalent VRAM instance on community spot markets (Lambda, RunPod, Vast). DIY cost is GPU plus a reasonable host build. Electricity at $0.13/kWh and full TDP slightly overstates real cost. Anything under 12 months strongly favors DIY; over 36 months favors cloud.
When should I pick Apple Silicon unified memory over a discrete GPU?: Apple Silicon wins on idle power, portability, and memory ceiling above 48 GB. An RTX 4090 wins on tokens-per-second throughput and prompt-processing speed. Pick Apple for interactive single-user; pick discrete GPU for batch or multi-user serving.
Does the calculator account for prompt processing memory separately?: The 2 GB plus 5 percent of params overhead covers activation memory during a single forward pass at typical context lengths. Prompt processing is compute-bound, not memory-bound; weights plus KV cache set the VRAM ceiling. For 64K+ tokens on a small model bump one GPU tier.
Can I run a 70B model on CPU only?: Technically yes with 64 GB RAM and a Q4 GGUF, but you get under one token per second. The calculator warns above 30B for CPU-only because the wall-clock latency is impractical for interactive work. Below 13B at Q4, CPU-only is fine for dev workflows.
How does concurrent users affect the VRAM number?: KV cache is per session, so concurrent users multiply it linearly. Ten users at 32K context on a 7B model is roughly 40 GB of KV cache alone. Paged-attention engines (vLLM, TensorRT-LLM) share KV pages tighter in production; the linear estimate here is the safe homelab ceiling.
Why do you recommend an 850 watt PSU when the GPU is only 450 watts?: CPU, RAM, board, drives, and fans add 100-200 W under load, and Ada-generation GPUs hit twice their TDP on transient spikes. A PSU sized at 1.3x sustained leaves headroom for spikes. The tool rounds up to standard PSU sizes (550, 650, 750, 850, 1000, 1200, 1500) so you can buy a real product.

Related tools and guides on OpenAI Tools Hub

AI Model Comparison Tool

Pick the right model for the VRAM budget the calculator gave you - context window, tool-use quality, and license.

LLM API Token Cost Calculator

Compare a hosted API bill to your DIY hardware breakeven for the same workload.

Vector DB Cost Calculator

Size embeddings memory and disk for the RAG layer that sits in front of your local LLM.

MCP Server Boilerplate Generator

Wrap your local inference server as an MCP tool Claude Code can call directly.

Claude Code MCP Config Generator

Generate the ~/.claude.json entry that wires your self-hosted server into Claude Code.

Claude Code Memory: Large Codebases

Why long-context workloads multiply your VRAM bill and how to manage them on a fixed hardware budget.