How Far Have Small Models Come? Choosing a Small LLM for Air-Gapped Deployment
The current state of 1B–32B open-weight LLMs, the per-task 'good enough' threshold, and how model size relates to quantization and hardware — backed by verified numbers.
The notion that "small models aren't usable" has already collapsed. The real question is which task, how many billion parameters, and what hardware. Gartner projects that by 2027 organizations will use small, task-specific models roughly 3× more than general-purpose LLMs. In air-gapped or on-premise environments where data must not leave the network, this choice often decides whether self-hosting is viable at all.
This post is not a "latest models" roundup. It aims to be a decision tool for anyone evaluating self-hosting. One per-task threshold table and one verifiable memory formula will answer most sizing questions.
On the reliability of these numbers: the benchmarks and memory figures below were verified against primary sources wherever possible (HuggingFace model cards, official llama.cpp docs, arXiv). Some 2026 successor models (Gemma 4, Qwen3.5/3.6, Mistral Small 4, etc.) were confirmed to exist, but their benchmarks are published only as images, so the numbers could not be verified. The figures in this post therefore use the verified Qwen3, Phi-4, Gemma 3, and SmolLM3 generations as the baseline.
Where Things Stand Today
The absolute performance of small models has surpassed the mid-sized models of a year or two ago. The verified headline scores:
| Model | Size | License | Headline scores |
|---|---|---|---|
| Phi-4 | 14.7B | MIT | MMLU 84.8 · MATH 80.4 · GPQA 56.1 |
| Phi-4-reasoning-plus | 14B | MIT | AIME'24 81.3 · GPQA-Diamond 68.9 · MMLU-Pro 76.0 |
| Phi-4-mini-instruct | 3.8B | MIT | GSM8K 88.6 · HumanEval 74.4 · MMLU 67.3 |
| Qwen3-32B | 32.8B | Apache 2.0 | AIME'24 79.5 · LiveCodeBench v5 62.7 · BFCL v3 66.4 |
| Qwen3-30B-A3B (MoE) | 30.5B/3.3B active | Apache 2.0 | AIME'24 80.4 · BFCL v3 69.1 |
| Gemma-3-27B-IT | 27B | Gemma | MMLU-Pro 67.5 · MATH 69.0 · GPQA-Diamond 42.4 |
| SmolLM3-3B (reasoning) | 3B | Apache 2.0 | AIME'25 36.7 · GPQA-Diamond 41.7 |
The takeaway: even 3–4B-class models post meaningful scores on math and coding, and 14B reaches into territory that was once frontier-grade. But you can't judge "usable" from a single score line — the threshold differs by task.
The Per-Task "Good Enough" Threshold
This is the most important table in the post. It lays out, per task, "the minimum size that's usable in practice" and "the cliff below which it gets risky."
| Task | Minimum usable size | Quality cliff | Evidence |
|---|---|---|---|
| Classification / sentiment / routing | Under 1B (fine-tuned 125M encoder) | Driven by training data, not size | arXiv 2406.08660 |
| Summarization | 7–8B (well-aligned 4B works) | Below ~3–4B | Vectara hallucination leaderboard |
| RAG / grounded QA | 7B (2–3B needs fine-tuning) | Below 2–3B not recommended | Microsoft RAG study |
| Function calling (single-turn) / JSON | 7–8B | Drops sharply on multi-turn | Berkeley BFCL |
| Coding — autocomplete | 7B | — | Qwen2.5-Coder report |
| Coding — agentic (repo-level) | 32B+ (still sub-frontier) | Below 14B | SWE-Dev |
| Multi-step agents | No reliable threshold | ≤8B unreliable | MCP-Bench |
| Translation | 9–12B (specialized models favored) | Below 7B, esp. low-resource | Tower+ / Aya |
| Long context | Effective 16–32K | Beyond effective length | RULER |
Three practical conclusions emerge from this table.
1) Low-stakes tasks are already fine at 7–8B. Classification, summarization, single-turn tool calls, and RAG can go to production on a 7–8B model running on a laptop or a single GPU. This is exactly the territory where "we don't need an external API" holds in an air-gapped setting. For classification specifically, a fine-tuned sub-1B encoder can even beat zero-shot GPT-4.
2) Agents and repo-level coding are still dominated by size. On SWE-Dev, 7B scores about 23% while 32B scores about 37% — nearly double. Trying to run an autonomous multi-step agent on an 8B model and failing is a common trap. Small models must be tightly scaffolded into narrow scopes.
3) Don't take "128K context" at face value. Per the RULER benchmark, the effective context of 7–8B models is only a fraction of the advertised length. Mistral-7B, for instance, advertises 32K but is effective to roughly 16K, with accuracy collapsing in the 128K range. It's safer to design around an "effective 16–32K" assumption.
Size × Quantization × Hardware
The core question of self-hosting is "what will actually run on my hardware." The answer comes from a single formula.
Required memory ≈ (param count × bytes per precision) + KV cache + overhead (10–20%)Here's the weights-only table. Per official llama.cpp figures, Q4_K_M is about 4.9 bits/weight (≈0.61 bytes/param), slightly larger than the commonly cited "INT4 = 0.5 bytes."
| Model | FP16 | Q8 | Q4_K_M (recommended) | Realistically runs on |
|---|---|---|---|---|
| 1–3B | 2–6GB | 1–3GB | 0.6–2GB | Laptop CPU, Raspberry Pi 5 |
| 7–8B | ~15GB | ~8GB | ~4.6GB | 16GB laptop/Mac, 24GB GPU |
| 13–14B | ~27GB | ~14GB | ~8GB | 32GB RAM, 24GB GPU |
| 27–32B | ~60GB | ~32GB | ~18–20GB | 24GB GPU (Q4), 80GB card (FP16) |
Verified facts (for an 8.03B model): Q4_K_M is about 4.58GiB, Q8_0 about 7.95GiB, F16 about 14.96GiB. Extending these ratios to other sizes gives the table above.
How Far Can You Quantize?
Per arXiv 2601.14277 (a unified quantization evaluation of Llama-3.1-8B):
- Q4_K_M: ~69% compression for about 1 point of MMLU loss. Effectively the sweet spot.
- Q5/Q6: step up one level for sensitive tasks like code and math.
- Below Q3: quality loss becomes visibly large — use only when memory is truly scarce.
Q4 is plenty for summarization and classification, but remember that code generation, math, and reasoning are more sensitive to quantization.
KV Cache Dominates Weights at Long Context
Don't look at weights alone. As context grows, the KV cache eats memory.
KV cache = 2 × num_layers × num_KV_heads × head_dim × num_tokens × bytes_per_elemFor Llama-3-8B (32 layers, 8 KV heads, head_dim 128, FP16), that's about 128KB per token. So:
- 8K context: ~1.0GB
- 128K context: ~16.8GB — larger than the weights (Q4 ~4.6GB).
This is why GQA (Grouped-Query Attention), adopted by recent small models, is decisive: it cuts the cache roughly 4× by reducing KV heads. You can also quantize the KV cache itself (Q8/Q4 KV).
Getting a Feel for Throughput
| Environment | Model | Throughput |
|---|---|---|
| Raspberry Pi 5 | 3B Q4 | 2–7 tok/s |
| CPU-only server | 7B Q4 | 3–12 tok/s (memory-bandwidth bound) |
| M4 Max (Mac) | 7B Q4_0 | ~83 tok/s |
| RTX 4090 24GB | 8B Q4 | ~95–110 tok/s (single request) |
A Mac's unified memory lets the entire RAM pool act as VRAM, so 16GB handles 7–8B and 128GB can target up to 70B Q4. A single 80GB datacenter card (A100/H100) fits 32B FP16 or 70B INT4 on one card.
Choosing an Inference Engine
The same model fits different scenarios depending on the engine.
| Scenario | Recommended engine | Why |
|---|---|---|
| Laptop dev / air-gapped endpoints | llama.cpp, Ollama, LM Studio, llamafile | CPU+GPU hybrid, GGUF, offline by default |
| Single-GPU dev / small serving | Ollama, llama.cpp (llama-server), vLLM | OpenAI-compatible API, light ops |
| Multi-GPU production serving (throughput) | vLLM, SGLang | Continuous batching, tensor/pipeline parallelism, paged KV |
Points worth noting from an air-gapped / on-premise angle:
- The llama.cpp family (including Ollama, LM Studio, llamafile) has first-class ARM64 support, handling NEON and Apple Silicon Metal natively. The single-binary
llamafileis especially strong for air-gapped distribution. - vLLM also ships Arm CPU wheels since 0.11.2 (tested on Graviton3), and for GPU serving it separately optimizes prefill (compute-bound) and decode (memory-bound) via chunked prefill and continuous batching. If prefill latency is your bottleneck, this feature in vLLM/SGLang is a direct candidate solution.
- TGI (Text Generation Inference) is now in maintenance mode. For new builds, prefer vLLM or SGLang.
The Big Picture in 2026: Efficient Architectures and Data Sovereignty
Two forces drive the small/efficient model trend.
Architectural efficiency. Large open-weight models are driving down long-context cost with sparse attention. DeepSeek's DSA (Sparse Attention) was established in V3.2 (Dec 2025, MIT), and Z.ai's GLM-5.2 (Jun 2026, MIT, ~753B/40B active, 1M context) cut per-token FLOPs at 1M context with a DSA-based IndexShare. MiniMax M3, using its own MSA (MiniMax Sparse Attention), reports cutting per-token compute at 1M context to about 1/20 of the previous generation, with 9×+ prefill and 15×+ decode speedups. These are large models, but sparse/linear attention designs are clearly worth putting on the evaluation shortlist.
Data sovereignty and cost. The reasons organizations choose self-hosting are clear: regulations like ITAR, HIPAA, and GDPR; zero-egress requirements; and cost. Open-weight (especially Apache 2.0 / MIT) weights only need to be downloaded once and run inside the network, fundamentally resolving data-residency concerns. Gartner expects more than half of enterprise GenAI models to be domain-specific by 2027 (up from ~1% in 2024).
Adoption Checklist
Items to check before putting a small model into a new workload.
- Pin the task type — classification / summarization / RAG / tool-calling / coding / agent (thresholds differ)
- Pick a size above the threshold — meet the per-task minimum from the table above
- Confirm effective context — design around effective length (typically 16–32K), not the advertised length
- Decide quantization — default to Q4_K_M; Q5/Q6 for code and math; avoid below Q3
- Estimate memory — weights + KV cache (factoring in context length) + overhead
- Match hardware — laptop / single GPU / datacenter card
- Choose an inference engine — llama.cpp family for air-gapped/ARM64; vLLM/SGLang for production serving
- Check the license — Apache 2.0 / MIT are safe; verify terms for Llama / Gemma licenses
The era of small models is an era not of "size" but of "the right size for the task." Define the workload precisely, and the range you can run inside an air-gapped network — with no external API — is wider than you'd think.
References
- Model cards / releases: Qwen3 Technical Report, Phi-4-mini, Gemma 3, SmolLM3
- Per-task thresholds: RULER (long context), Berkeley BFCL, Vectara hallucination leaderboard
- Memory / quantization: llama.cpp quantize, quantization quality eval (arXiv 2601.14277)
- Efficient architectures: GLM-5.2, MiniMax M3, DeepSeek Sparse Attention
- Trends: Gartner: small task-specific models