GPU size estimation for LLMs

- tsp
Last update 05 Apr 2025
Reading time 6 mins

This is a short note for myself. It is an estimation of the VRAM requirements for different LLM sizes and their capabilities.

LLM Capability by Model Size

Model Size Key Characteristics Suitable Task Categories Comments VRAM Needed (Q4) VRAM Needed (FP16)
7B Basic reasoning, fluent text, short context memory - Simple Q&A
- Rewriting/paraphrasing
- Simple summarization
- Surface-level chatbots
Lightweight, but lacks nuance and depth. Often verbose or overly generic. 4-6 GB 14 GB
13B–15B Improved instruction following, slightly better logical consistency - Slightly more reliable summarization
- Code completion (basic)
- Chain-of-thought prompts (if guided)
- Agent tools (basic routing)
Better, but still weak at multi-step tasks and reasoning under uncertainty. 8-10 GB 26 GB
30–34B First strong “reasoning class”
Deep chain-of-thought, better long-context attention
- Multi-hop reasoning
- Code evaluation (unit tests, logic correctness)
- Multi-agent pipelines
- Embedded tool chains
- Long document synthesis
Huge jump in coherence and problem-solving. Best value tier for deep hobby work. 18-22 GB 60-70 GB
65–70B Starts to show emergent properties
Can imitate human-style discourse, deeper knowledge
- Philosophical or ethical reasoning
- Legal/academic summarization
- Emotional intelligence modeling
- Writing fiction with rich internal logic
- World modeling
Moves from “clever assistant” to “primitive theorist.” Context anchoring improves. 36-42 GB 130-140 GB
130B Full LLM “agent” capability emerges
Deeper memory, robust long-range logic
- Expert tutoring
- Interdisciplinary synthesis (e.g. philosophy + math)
- Semantic reasoning with abstraction
- Goal-directed planning (with external tools)
- Sparse memory/agent frameworks
Comparable to early GPT-3.5. Major qualitative change in planning, abstraction. 72-84 GB 260-280 GB
300–350B Rare insights, model-based hypothesis formation
Coherent across very long chains (>5 steps)
- Code interpreter level logic
- Simulating scientific debate
- Designing systems from specs
- Research assistant with hypothesis generation
- Autonomous agent coordination
Entry point to real autonomy. Can self-reflect in structured prompts. 150-190 GB 600 GB
500–700B “Soft AGI-like” tier
Stable under pressure, few-shot mastery, dense knowledge encoding
- Emulation of expert-level thought processes
- Emergent creative problem solving
- Implicit model of theory of mind
- Conscious simulation of user goals
- Meta-level reasoning (e.g. explain why a chain failed)
This is where GPT-4, Claude Opus, and Gemini Ultra operate. Expensive, but vastly more reliable and nuanced. You can build actual thinking systems here. 250-320 GB 1-1.2 TB

Breakdown of VRAM Estimation

Here’s a rough calculation logic that has been used:

So for example:

⚠️ Notes on VRAM Limits

Observed Emergent Thresholds (Based on Literature)

Emergent Ability Model Size Where It Appears Notes
Coherent Chain-of-Thought ~13–30B Very fragile below 13B, solidifies around 30B
Tool Use & Code Execution ~30B+ 13B can mimic syntax, but 30B understands intent
Abstraction & Analogy 65–130B Generalizes beyond examples, recognizes patterns
Planning Across Time ~130–300B Keeps task goals across long dialogue
Meta-reasoning / Self-critiquing ~500B+ Explains its own reasoning, debugs itself
Theory of Mind (ToM-lite) ~500–700B Starts modeling human beliefs/intentions

Strategy Tips for Use by Size

Cards for different sizes that are commonly available

As shown previously one can utilize more cards even on very small scale systems when using PCIe port expanders.

Note: All links provided below are Amazon affiliate links, this pages author profits from qualified purchases

Graphics card VRAM What is possible?
RTX 3060 12GB 12 GB per card Can easily build a 24GB VRAM system with two cards. A single card is capable of running image generators like SDXL or smaller models up to 13B in realtime, 32B models under some circumstances with reduced speed. 70B models are realistically unuseable with a single card. Price is low enough for hobby applications.
RTX 3090 Ti 24GB 24 GB per card Easily build a 48GB VRAM system with two cards capable of executing 32B models without trouble, under some circumstances 70b models work too. Also often used for SDXL. Price is still in reach for personal and hobbyist applications
A100 40 GB per card Easily build clusters with 80-160GB VRAM with multiple cards, usually price is prohibitive for hobby applications. Such systems are often used in professional applications. Machines with more than 4 of such cards are usually driving systems like the commonly known GPTs that are currently widely available as cloud service.

A quick size note on CPU inference

First off - CPU inference is usually prohibitive for LLMs. In the following rough estimated some major assumptions are made:

Under those assumptions the estimated inference speed is shown in the following table:

Model Size (Q4) CPU Inference Speed
LLaMA 13B ~7GB 2–8 tokens/sec
LLaMA 65B ~35GB ~0.3–0.8 tokens/sec
Custom 175B ~87GB ~0.1 tokens/sec
GPT-3 175B ~87GB ~0.1 tokens/sec
500B Q4 ~250GB 0.02–0.3 tokens/sec
700B Q4 ~350GB 0.01–0.1 tokens/sec

What Slows Down CPU Inference for Large Models?

  1. No specialized matrix hardware: CPUs lack fast tensor cores (like GPUs have).
  2. Cache thrashing: L3 caches (~60–100MB per socket) are tiny compared to model weight and KV cache.
  3. Memory bandwidth bottleneck: Even with DDR5 and multi-channel memory, throughput is far below what’s needed.
  4. Parallelization overhead: Spawning inference across 64+ threads can have diminishing returns due to NUMA and cache contention.

Data protection policy

Dipl.-Ing. Thomas Spielauer, Wien (webcomplains389t48957@tspi.at)

This webpage is also available via TOR at http://rh6v563nt2dnxd5h2vhhqkudmyvjaevgiv77c62xflas52d5omtkxuid.onion/

Valid HTML 4.01 Strict Powered by FreeBSD IPv6 support