llmopen-weightself-hostinginference

How Far Have Small Models Come? Choosing a Small LLM for Air-Gapped Deployment

The current state of 1B–32B open-weight LLMs, the per-task 'good enough' threshold, and how model size relates to quantization and hardware — backed by verified numbers.

Data DynamicsJune 22, 20269 min read

The notion that "small models aren't usable" has already collapsed. The real question is which task, how many billion parameters, and what hardware. Gartner projects that by 2027 organizations will use small, task-specific models roughly 3× more than general-purpose LLMs. In air-gapped or on-premise environments where data must not leave the network, this choice often decides whether self-hosting is viable at all.

This post is not a "latest models" roundup. It aims to be a decision tool for anyone evaluating self-hosting. One per-task threshold table and one verifiable memory formula will answer most sizing questions.

On the reliability of these numbers: the benchmarks and memory figures below were verified against primary sources wherever possible (HuggingFace model cards, official llama.cpp docs, arXiv). Some 2026 successor models (Gemma 4, Qwen3.5/3.6, Mistral Small 4, etc.) were confirmed to exist, but their benchmarks are published only as images, so the numbers could not be verified. The figures in this post therefore use the verified Qwen3, Phi-4, Gemma 3, and SmolLM3 generations as the baseline.

Where Things Stand Today

The absolute performance of small models has surpassed the mid-sized models of a year or two ago. The verified headline scores:

Model	Size	License	Headline scores
Phi-4	14.7B	MIT	MMLU 84.8 · MATH 80.4 · GPQA 56.1
Phi-4-reasoning-plus	14B	MIT	AIME'24 81.3 · GPQA-Diamond 68.9 · MMLU-Pro 76.0
Phi-4-mini-instruct	3.8B	MIT	GSM8K 88.6 · HumanEval 74.4 · MMLU 67.3
Qwen3-32B	32.8B	Apache 2.0	AIME'24 79.5 · LiveCodeBench v5 62.7 · BFCL v3 66.4
Qwen3-30B-A3B (MoE)	30.5B/3.3B active	Apache 2.0	AIME'24 80.4 · BFCL v3 69.1
Gemma-3-27B-IT	27B	Gemma	MMLU-Pro 67.5 · MATH 69.0 · GPQA-Diamond 42.4
SmolLM3-3B (reasoning)	3B	Apache 2.0	AIME'25 36.7 · GPQA-Diamond 41.7

The takeaway: even 3–4B-class models post meaningful scores on math and coding, and 14B reaches into territory that was once frontier-grade. But you can't judge "usable" from a single score line — the threshold differs by task.

The Per-Task "Good Enough" Threshold

This is the most important table in the post. It lays out, per task, "the minimum size that's usable in practice" and "the cliff below which it gets risky."

Task	Minimum usable size	Quality cliff	Evidence
Classification / sentiment / routing	Under 1B (fine-tuned 125M encoder)	Driven by training data, not size	arXiv 2406.08660
Summarization	7–8B (well-aligned 4B works)	Below ~3–4B	Vectara hallucination leaderboard
RAG / grounded QA	7B (2–3B needs fine-tuning)	Below 2–3B not recommended	Microsoft RAG study
Function calling (single-turn) / JSON	7–8B	Drops sharply on multi-turn	Berkeley BFCL
Coding — autocomplete	7B	—	Qwen2.5-Coder report
Coding — agentic (repo-level)	32B+ (still sub-frontier)	Below 14B	SWE-Dev
Multi-step agents	No reliable threshold	≤8B unreliable	MCP-Bench
Translation	9–12B (specialized models favored)	Below 7B, esp. low-resource	Tower+ / Aya
Long context	Effective 16–32K	Beyond effective length	RULER

Three practical conclusions emerge from this table.

1) Low-stakes tasks are already fine at 7–8B. Classification, summarization, single-turn tool calls, and RAG can go to production on a 7–8B model running on a laptop or a single GPU. This is exactly the territory where "we don't need an external API" holds in an air-gapped setting. For classification specifically, a fine-tuned sub-1B encoder can even beat zero-shot GPT-4.

2) Agents and repo-level coding are still dominated by size. On SWE-Dev, 7B scores about 23% while 32B scores about 37% — nearly double. Trying to run an autonomous multi-step agent on an 8B model and failing is a common trap. Small models must be tightly scaffolded into narrow scopes.

3) Don't take "128K context" at face value. Per the RULER benchmark, the effective context of 7–8B models is only a fraction of the advertised length. Mistral-7B, for instance, advertises 32K but is effective to roughly 16K, with accuracy collapsing in the 128K range. It's safer to design around an "effective 16–32K" assumption.

Size × Quantization × Hardware

The core question of self-hosting is "what will actually run on my hardware." The answer comes from a single formula.

Required memory ≈ (param count × bytes per precision) + KV cache + overhead (10–20%)

Here's the weights-only table. Per official llama.cpp figures, Q4_K_M is about 4.9 bits/weight (≈0.61 bytes/param), slightly larger than the commonly cited "INT4 = 0.5 bytes."

Model	FP16	Q8	Q4_K_M (recommended)	Realistically runs on
1–3B	2–6GB	1–3GB	0.6–2GB	Laptop CPU, Raspberry Pi 5
7–8B	~15GB	~8GB	~4.6GB	16GB laptop/Mac, 24GB GPU
13–14B	~27GB	~14GB	~8GB	32GB RAM, 24GB GPU
27–32B	~60GB	~32GB	~18–20GB	24GB GPU (Q4), 80GB card (FP16)

Verified facts (for an 8.03B model): Q4_K_M is about 4.58GiB, Q8_0 about 7.95GiB, F16 about 14.96GiB. Extending these ratios to other sizes gives the table above.

How Far Can You Quantize?

Per arXiv 2601.14277 (a unified quantization evaluation of Llama-3.1-8B):

Q4_K_M: ~69% compression for about 1 point of MMLU loss. Effectively the sweet spot.
Q5/Q6: step up one level for sensitive tasks like code and math.
Below Q3: quality loss becomes visibly large — use only when memory is truly scarce.

Q4 is plenty for summarization and classification, but remember that code generation, math, and reasoning are more sensitive to quantization.

KV Cache Dominates Weights at Long Context

Don't look at weights alone. As context grows, the KV cache eats memory.

KV cache = 2 × num_layers × num_KV_heads × head_dim × num_tokens × bytes_per_elem

For Llama-3-8B (32 layers, 8 KV heads, head_dim 128, FP16), that's about 128KB per token. So:

8K context: ~1.0GB
128K context: ~16.8GB — larger than the weights (Q4 ~4.6GB).

This is why GQA (Grouped-Query Attention), adopted by recent small models, is decisive: it cuts the cache roughly 4× by reducing KV heads. You can also quantize the KV cache itself (Q8/Q4 KV).

Getting a Feel for Throughput

Environment	Model	Throughput
Raspberry Pi 5	3B Q4	2–7 tok/s
CPU-only server	7B Q4	3–12 tok/s (memory-bandwidth bound)
M4 Max (Mac)	7B Q4_0	~83 tok/s
RTX 4090 24GB	8B Q4	~95–110 tok/s (single request)

A Mac's unified memory lets the entire RAM pool act as VRAM, so 16GB handles 7–8B and 128GB can target up to 70B Q4. A single 80GB datacenter card (A100/H100) fits 32B FP16 or 70B INT4 on one card.

Choosing an Inference Engine

The same model fits different scenarios depending on the engine.

Scenario	Recommended engine	Why
Laptop dev / air-gapped endpoints	llama.cpp, Ollama, LM Studio, llamafile	CPU+GPU hybrid, GGUF, offline by default
Single-GPU dev / small serving	Ollama, llama.cpp (`llama-server`), vLLM	OpenAI-compatible API, light ops
Multi-GPU production serving (throughput)	vLLM, SGLang	Continuous batching, tensor/pipeline parallelism, paged KV

Points worth noting from an air-gapped / on-premise angle:

The llama.cpp family (including Ollama, LM Studio, llamafile) has first-class ARM64 support, handling NEON and Apple Silicon Metal natively. The single-binary llamafile is especially strong for air-gapped distribution.
vLLM also ships Arm CPU wheels since 0.11.2 (tested on Graviton3), and for GPU serving it separately optimizes prefill (compute-bound) and decode (memory-bound) via chunked prefill and continuous batching. If prefill latency is your bottleneck, this feature in vLLM/SGLang is a direct candidate solution.
TGI (Text Generation Inference) is now in maintenance mode. For new builds, prefer vLLM or SGLang.

The Big Picture in 2026: Efficient Architectures and Data Sovereignty

Two forces drive the small/efficient model trend.

Architectural efficiency. Large open-weight models are driving down long-context cost with sparse attention. DeepSeek's DSA (Sparse Attention) was established in V3.2 (Dec 2025, MIT), and Z.ai's GLM-5.2 (Jun 2026, MIT, ~753B/40B active, 1M context) cut per-token FLOPs at 1M context with a DSA-based IndexShare. MiniMax M3, using its own MSA (MiniMax Sparse Attention), reports cutting per-token compute at 1M context to about 1/20 of the previous generation, with 9×+ prefill and 15×+ decode speedups. These are large models, but sparse/linear attention designs are clearly worth putting on the evaluation shortlist.

Data sovereignty and cost. The reasons organizations choose self-hosting are clear: regulations like ITAR, HIPAA, and GDPR; zero-egress requirements; and cost. Open-weight (especially Apache 2.0 / MIT) weights only need to be downloaded once and run inside the network, fundamentally resolving data-residency concerns. Gartner expects more than half of enterprise GenAI models to be domain-specific by 2027 (up from ~1% in 2024).

Adoption Checklist

Items to check before putting a small model into a new workload.

Pin the task type — classification / summarization / RAG / tool-calling / coding / agent (thresholds differ)
Pick a size above the threshold — meet the per-task minimum from the table above
Confirm effective context — design around effective length (typically 16–32K), not the advertised length
Decide quantization — default to Q4_K_M; Q5/Q6 for code and math; avoid below Q3
Estimate memory — weights + KV cache (factoring in context length) + overhead
Match hardware — laptop / single GPU / datacenter card
Choose an inference engine — llama.cpp family for air-gapped/ARM64; vLLM/SGLang for production serving
Check the license — Apache 2.0 / MIT are safe; verify terms for Llama / Gemma licenses

The era of small models is an era not of "size" but of "the right size for the task." Define the workload precisely, and the range you can run inside an air-gapped network — with no external API — is wider than you'd think.

References

Model cards / releases: Qwen3 Technical Report, Phi-4-mini, Gemma 3, SmolLM3
Per-task thresholds: RULER (long context), Berkeley BFCL, Vectara hallucination leaderboard
Memory / quantization: llama.cpp quantize, quantization quality eval (arXiv 2601.14277)
Efficient architectures: GLM-5.2, MiniMax M3, DeepSeek Sparse Attention
Trends: Gartner: small task-specific models