GPU size estimation for LLMs

- tsp
Last update 05 Apr 2025
Reading time 6 mins

This is a short note for myself. It is an estimation of the VRAM requirements for different LLM sizes and their capabilities.

LLM Capability by Model Size

Model Size	Key Characteristics	Suitable Task Categories	Comments	VRAM Needed (Q4)	VRAM Needed (FP16)
7B	Basic reasoning, fluent text, short context memory	- Simple Q&A - Rewriting/paraphrasing - Simple summarization - Surface-level chatbots	Lightweight, but lacks nuance and depth. Often verbose or overly generic.	4-6 GB	14 GB
13B–15B	Improved instruction following, slightly better logical consistency	- Slightly more reliable summarization - Code completion (basic) - Chain-of-thought prompts (if guided) - Agent tools (basic routing)	Better, but still weak at multi-step tasks and reasoning under uncertainty.	8-10 GB	26 GB
30–34B	First strong “reasoning class” Deep chain-of-thought, better long-context attention	- Multi-hop reasoning - Code evaluation (unit tests, logic correctness) - Multi-agent pipelines - Embedded tool chains - Long document synthesis	Huge jump in coherence and problem-solving. Best value tier for deep hobby work.	18-22 GB	60-70 GB
65–70B	Starts to show emergent properties Can imitate human-style discourse, deeper knowledge	- Philosophical or ethical reasoning - Legal/academic summarization - Emotional intelligence modeling - Writing fiction with rich internal logic - World modeling	Moves from “clever assistant” to “primitive theorist.” Context anchoring improves.	36-42 GB	130-140 GB
130B	Full LLM “agent” capability emerges Deeper memory, robust long-range logic	- Expert tutoring - Interdisciplinary synthesis (e.g. philosophy + math) - Semantic reasoning with abstraction - Goal-directed planning (with external tools) - Sparse memory/agent frameworks	Comparable to early GPT-3.5. Major qualitative change in planning, abstraction.	72-84 GB	260-280 GB
300–350B	Rare insights, model-based hypothesis formation Coherent across very long chains (>5 steps)	- Code interpreter level logic - Simulating scientific debate - Designing systems from specs - Research assistant with hypothesis generation - Autonomous agent coordination	Entry point to real autonomy. Can self-reflect in structured prompts.	150-190 GB	600 GB
500–700B	“Soft AGI-like” tier Stable under pressure, few-shot mastery, dense knowledge encoding	- Emulation of expert-level thought processes - Emergent creative problem solving - Implicit model of theory of mind - Conscious simulation of user goals - Meta-level reasoning (e.g. explain why a chain failed)	This is where GPT-4, Claude Opus, and Gemini Ultra operate. Expensive, but vastly more reliable and nuanced. You can build actual thinking systems here.	250-320 GB	1-1.2 TB

Breakdown of VRAM Estimation

Here’s a rough calculation logic that has been used:

A model’s FP16 size ≈ #params × 2 bytes
Q4 quantized models compress to ~25–35% of full size
Overhead for context window, KV cache, tokenizer buffers: ~2–8 GB depending on context length and threading
GGUF quantized models use memory slightly more efficiently than HF models due to format optimization

So for example:

65B FP16 = 65B × 2 bytes = 130 GB
Q4_K_M: ~30–35% → ~39–45 GB
Add KV cache and headroom → ~42 GB VRAM minimum

⚠️ Notes on VRAM Limits

These are minimums for no offload. You can run larger models on less VRAM with CPU offloading, but performance will drop rapidly due to PCIe bottlenecks.
Quantization type matters: Q8 needs ~2× more VRAM than Q4, Q2 needs ~50% less, but may reduce reasoning quality significantly.
Memory-mapped quantized models with GPU+CPU hybrid load can cut these values down by ~30–40%, but will sacrifice performance.

Observed Emergent Thresholds (Based on Literature)

Emergent Ability	Model Size Where It Appears	Notes
Coherent Chain-of-Thought	~13–30B	Very fragile below 13B, solidifies around 30B
Tool Use & Code Execution	~30B+	13B can mimic syntax, but 30B understands intent
Abstraction & Analogy	65–130B	Generalizes beyond examples, recognizes patterns
Planning Across Time	~130–300B	Keeps task goals across long dialogue
Meta-reasoning / Self-critiquing	~500B+	Explains its own reasoning, debugs itself
Theory of Mind (ToM-lite)	~500–700B	Starts modeling human beliefs/intentions

Strategy Tips for Use by Size

<30B: Great for tooling, fast inference, but don’t trust for multi-step thought or complex moral/logical reasoning.
30–65B: Sweet spot for hobbyist research. Coherent, insightful, usable on consumer-grade systems (esp. with multi-GPU).
130B+: Needs serious infrastructure but allows agents, system design, technical analysis.
500B+: Professional research only (cloud or supercluster). Ideal for emulating human-like general intelligence in narrow settings.

Cards for different sizes that are commonly available

As shown previously one can utilize more cards even on very small scale systems when using PCIe port expanders.

Note: All links provided below are Amazon affiliate links, this pages author profits from qualified purchases

Graphics card	VRAM	What is possible?
RTX 3060 12GB	12 GB per card	Can easily build a 24GB VRAM system with two cards. A single card is capable of running image generators like SDXL or smaller models up to 13B in realtime, 32B models under some circumstances with reduced speed. 70B models are realistically unuseable with a single card. Price is low enough for hobby applications.
RTX 3090 Ti 24GB	24 GB per card	Easily build a 48GB VRAM system with two cards capable of executing 32B models without trouble, under some circumstances 70b models work too. Also often used for SDXL. Price is still in reach for personal and hobbyist applications
A100	40 GB per card	Easily build clusters with 80-160GB VRAM with multiple cards, usually price is prohibitive for hobby applications. Such systems are often used in professional applications. Machines with more than 4 of such cards are usually driving systems like the commonly known GPTs that are currently widely available as cloud service.

A quick size note on CPU inference

First off - CPU inference is usually prohibitive for LLMs. In the following rough estimated some major assumptions are made:

High-end CPU (e.g., 64–128 core AMD EPYC or Intel Xeon)
AVX-512 or VNNI acceleration (if supported)
Memory bandwidth of ~200–300 GB/s aggregate
Model fits fully in RAM (no swapping)
Using optimized inference engine (e.g., llama.cpp, ggml, or gguf-based backends)
Single-token, batch-1 inference

Under those assumptions the estimated inference speed is shown in the following table:

Model	Size (Q4)	CPU Inference Speed
LLaMA 13B	~7GB	2–8 tokens/sec
LLaMA 65B	~35GB	~0.3–0.8 tokens/sec
Custom 175B	~87GB	~0.1 tokens/sec
GPT-3 175B	~87GB	~0.1 tokens/sec
500B Q4	~250GB	0.02–0.3 tokens/sec
700B Q4	~350GB	0.01–0.1 tokens/sec

What Slows Down CPU Inference for Large Models?

No specialized matrix hardware: CPUs lack fast tensor cores (like GPUs have).
Cache thrashing: L3 caches (~60–100MB per socket) are tiny compared to model weight and KV cache.
Memory bandwidth bottleneck: Even with DDR5 and multi-channel memory, throughput is far below what’s needed.
Parallelization overhead: Spawning inference across 64+ threads can have diminishing returns due to NUMA and cache contention.

GPU size estimation for LLMs

LLM Capability by Model Size

Breakdown of VRAM Estimation

⚠️ Notes on VRAM Limits

Observed Emergent Thresholds (Based on Literature)

Strategy Tips for Use by Size

Cards for different sizes that are commonly available

A quick size note on CPU inference

What Slows Down CPU Inference for Large Models?

Related articles

Expanding GPU Capabilities on Notebooks and Mini PCs Without PCIe Slots via M.2 NVMe Slots

Some simple image filters by convolution in Python and OpenCL

A first quick look on different sentence embedding methods - playing with word and sentence embeddings

Harnessing the Power of GPTs - or why GPTs are not better search engines

Building Semantic Suggested Articles for a Static Blog (and How To Visualize Embeddings)

Coding with an AI Assistant: My Ongoing Journey into Vibe Coding

Setting Parameters like Context Length and Temperature in Ollama Models

Architecting Intelligence: A Comprehensive Guide to LLM Agent Patterns and Behaviors

Also on this blog

ISC-DHCPD events triggering native hooks from within a chroot

Maxwell equation summary

Android development on FreeBSD

Frama-C predicates