- tsp
Last update 05 Apr 2025
6 mins
This is a short note for myself. It is an estimation of the VRAM requirements for different LLM sizes and their capabilities.
Model Size | Key Characteristics | Suitable Task Categories | Comments | VRAM Needed (Q4) | VRAM Needed (FP16) |
---|---|---|---|---|---|
7B | Basic reasoning, fluent text, short context memory | - Simple Q&A - Rewriting/paraphrasing - Simple summarization - Surface-level chatbots |
Lightweight, but lacks nuance and depth. Often verbose or overly generic. | 4-6 GB | 14 GB |
13B–15B | Improved instruction following, slightly better logical consistency | - Slightly more reliable summarization - Code completion (basic) - Chain-of-thought prompts (if guided) - Agent tools (basic routing) |
Better, but still weak at multi-step tasks and reasoning under uncertainty. | 8-10 GB | 26 GB |
30–34B | First strong “reasoning class” Deep chain-of-thought, better long-context attention |
- Multi-hop reasoning - Code evaluation (unit tests, logic correctness) - Multi-agent pipelines - Embedded tool chains - Long document synthesis |
Huge jump in coherence and problem-solving. Best value tier for deep hobby work. | 18-22 GB | 60-70 GB |
65–70B | Starts to show emergent properties Can imitate human-style discourse, deeper knowledge |
- Philosophical or ethical reasoning - Legal/academic summarization - Emotional intelligence modeling - Writing fiction with rich internal logic - World modeling |
Moves from “clever assistant” to “primitive theorist.” Context anchoring improves. | 36-42 GB | 130-140 GB |
130B | Full LLM “agent” capability emerges Deeper memory, robust long-range logic |
- Expert tutoring - Interdisciplinary synthesis (e.g. philosophy + math) - Semantic reasoning with abstraction - Goal-directed planning (with external tools) - Sparse memory/agent frameworks |
Comparable to early GPT-3.5. Major qualitative change in planning, abstraction. | 72-84 GB | 260-280 GB |
300–350B | Rare insights, model-based hypothesis formation Coherent across very long chains (>5 steps) |
- Code interpreter level logic - Simulating scientific debate - Designing systems from specs - Research assistant with hypothesis generation - Autonomous agent coordination |
Entry point to real autonomy. Can self-reflect in structured prompts. | 150-190 GB | 600 GB |
500–700B | “Soft AGI-like” tier Stable under pressure, few-shot mastery, dense knowledge encoding |
- Emulation of expert-level thought processes - Emergent creative problem solving - Implicit model of theory of mind - Conscious simulation of user goals - Meta-level reasoning (e.g. explain why a chain failed) |
This is where GPT-4, Claude Opus, and Gemini Ultra operate. Expensive, but vastly more reliable and nuanced. You can build actual thinking systems here. | 250-320 GB | 1-1.2 TB |
Here’s a rough calculation logic that has been used:
#params × 2 bytes
So for example:
Q8
needs ~2× more VRAM than Q4
, Q2
needs ~50% less, but may reduce reasoning quality significantly.Emergent Ability | Model Size Where It Appears | Notes |
---|---|---|
Coherent Chain-of-Thought | ~13–30B | Very fragile below 13B, solidifies around 30B |
Tool Use & Code Execution | ~30B+ | 13B can mimic syntax, but 30B understands intent |
Abstraction & Analogy | 65–130B | Generalizes beyond examples, recognizes patterns |
Planning Across Time | ~130–300B | Keeps task goals across long dialogue |
Meta-reasoning / Self-critiquing | ~500B+ | Explains its own reasoning, debugs itself |
Theory of Mind (ToM-lite) | ~500–700B | Starts modeling human beliefs/intentions |
As shown previously one can utilize more cards even on very small scale systems when using PCIe port expanders.
Note: All links provided below are Amazon affiliate links, this pages author profits from qualified purchases
Graphics card | VRAM | What is possible? |
---|---|---|
RTX 3060 12GB | 12 GB per card | Can easily build a 24GB VRAM system with two cards. A single card is capable of running image generators like SDXL or smaller models up to 13B in realtime, 32B models under some circumstances with reduced speed. 70B models are realistically unuseable with a single card. Price is low enough for hobby applications. |
RTX 3090 Ti 24GB | 24 GB per card | Easily build a 48GB VRAM system with two cards capable of executing 32B models without trouble, under some circumstances 70b models work too. Also often used for SDXL. Price is still in reach for personal and hobbyist applications |
A100 | 40 GB per card | Easily build clusters with 80-160GB VRAM with multiple cards, usually price is prohibitive for hobby applications. Such systems are often used in professional applications. Machines with more than 4 of such cards are usually driving systems like the commonly known GPTs that are currently widely available as cloud service. |
First off - CPU inference is usually prohibitive for LLMs. In the following rough estimated some major assumptions are made:
llama.cpp
, ggml
, or gguf
-based backends)Under those assumptions the estimated inference speed is shown in the following table:
Model | Size (Q4) | CPU Inference Speed |
---|---|---|
LLaMA 13B | ~7GB | 2–8 tokens/sec |
LLaMA 65B | ~35GB | ~0.3–0.8 tokens/sec |
Custom 175B | ~87GB | ~0.1 tokens/sec |
GPT-3 175B | ~87GB | ~0.1 tokens/sec |
500B Q4 | ~250GB | 0.02–0.3 tokens/sec |
700B Q4 | ~350GB | 0.01–0.1 tokens/sec |
Dipl.-Ing. Thomas Spielauer, Wien (webcomplains389t48957@tspi.at)
This webpage is also available via TOR at http://rh6v563nt2dnxd5h2vhhqkudmyvjaevgiv77c62xflas52d5omtkxuid.onion/