Blog
llmopen-weightself-hostinginference

How Far Have Small Models Come? Choosing a Small LLM for Air-Gapped Deployment

The current state of 1B–32B open-weight LLMs, the per-task 'good enough' threshold, and how model size relates to quantization and hardware — backed by verified numbers.

Data DynamicsJune 22, 20269 min read

The notion that "small models aren't usable" has already collapsed. The real question is which task, how many billion parameters, and what hardware. Gartner projects that by 2027 organizations will use small, task-specific models roughly 3× more than general-purpose LLMs. In air-gapped or on-premise environments where data must not leave the network, this choice often decides whether self-hosting is viable at all.

This post is not a "latest models" roundup. It aims to be a decision tool for anyone evaluating self-hosting. One per-task threshold table and one verifiable memory formula will answer most sizing questions.

On the reliability of these numbers: the benchmarks and memory figures below were verified against primary sources wherever possible (HuggingFace model cards, official llama.cpp docs, arXiv). Some 2026 successor models (Gemma 4, Qwen3.5/3.6, Mistral Small 4, etc.) were confirmed to exist, but their benchmarks are published only as images, so the numbers could not be verified. The figures in this post therefore use the verified Qwen3, Phi-4, Gemma 3, and SmolLM3 generations as the baseline.

Where Things Stand Today

The absolute performance of small models has surpassed the mid-sized models of a year or two ago. The verified headline scores:

ModelSizeLicenseHeadline scores
Phi-414.7BMITMMLU 84.8 · MATH 80.4 · GPQA 56.1
Phi-4-reasoning-plus14BMITAIME'24 81.3 · GPQA-Diamond 68.9 · MMLU-Pro 76.0
Phi-4-mini-instruct3.8BMITGSM8K 88.6 · HumanEval 74.4 · MMLU 67.3
Qwen3-32B32.8BApache 2.0AIME'24 79.5 · LiveCodeBench v5 62.7 · BFCL v3 66.4
Qwen3-30B-A3B (MoE)30.5B/3.3B activeApache 2.0AIME'24 80.4 · BFCL v3 69.1
Gemma-3-27B-IT27BGemmaMMLU-Pro 67.5 · MATH 69.0 · GPQA-Diamond 42.4
SmolLM3-3B (reasoning)3BApache 2.0AIME'25 36.7 · GPQA-Diamond 41.7

The takeaway: even 3–4B-class models post meaningful scores on math and coding, and 14B reaches into territory that was once frontier-grade. But you can't judge "usable" from a single score line — the threshold differs by task.

The Per-Task "Good Enough" Threshold

This is the most important table in the post. It lays out, per task, "the minimum size that's usable in practice" and "the cliff below which it gets risky."

TaskMinimum usable sizeQuality cliffEvidence
Classification / sentiment / routingUnder 1B (fine-tuned 125M encoder)Driven by training data, not sizearXiv 2406.08660
Summarization7–8B (well-aligned 4B works)Below ~3–4BVectara hallucination leaderboard
RAG / grounded QA7B (2–3B needs fine-tuning)Below 2–3B not recommendedMicrosoft RAG study
Function calling (single-turn) / JSON7–8BDrops sharply on multi-turnBerkeley BFCL
Coding — autocomplete7BQwen2.5-Coder report
Coding — agentic (repo-level)32B+ (still sub-frontier)Below 14BSWE-Dev
Multi-step agentsNo reliable threshold≤8B unreliableMCP-Bench
Translation9–12B (specialized models favored)Below 7B, esp. low-resourceTower+ / Aya
Long contextEffective 16–32KBeyond effective lengthRULER

Three practical conclusions emerge from this table.

1) Low-stakes tasks are already fine at 7–8B. Classification, summarization, single-turn tool calls, and RAG can go to production on a 7–8B model running on a laptop or a single GPU. This is exactly the territory where "we don't need an external API" holds in an air-gapped setting. For classification specifically, a fine-tuned sub-1B encoder can even beat zero-shot GPT-4.

2) Agents and repo-level coding are still dominated by size. On SWE-Dev, 7B scores about 23% while 32B scores about 37% — nearly double. Trying to run an autonomous multi-step agent on an 8B model and failing is a common trap. Small models must be tightly scaffolded into narrow scopes.

3) Don't take "128K context" at face value. Per the RULER benchmark, the effective context of 7–8B models is only a fraction of the advertised length. Mistral-7B, for instance, advertises 32K but is effective to roughly 16K, with accuracy collapsing in the 128K range. It's safer to design around an "effective 16–32K" assumption.

Size × Quantization × Hardware

The core question of self-hosting is "what will actually run on my hardware." The answer comes from a single formula.

Required memory ≈ (param count × bytes per precision) + KV cache + overhead (10–20%)

Here's the weights-only table. Per official llama.cpp figures, Q4_K_M is about 4.9 bits/weight (≈0.61 bytes/param), slightly larger than the commonly cited "INT4 = 0.5 bytes."

ModelFP16Q8Q4_K_M (recommended)Realistically runs on
1–3B2–6GB1–3GB0.6–2GBLaptop CPU, Raspberry Pi 5
7–8B~15GB~8GB~4.6GB16GB laptop/Mac, 24GB GPU
13–14B~27GB~14GB~8GB32GB RAM, 24GB GPU
27–32B~60GB~32GB~18–20GB24GB GPU (Q4), 80GB card (FP16)

Verified facts (for an 8.03B model): Q4_K_M is about 4.58GiB, Q8_0 about 7.95GiB, F16 about 14.96GiB. Extending these ratios to other sizes gives the table above.

How Far Can You Quantize?

Per arXiv 2601.14277 (a unified quantization evaluation of Llama-3.1-8B):

  • Q4_K_M: ~69% compression for about 1 point of MMLU loss. Effectively the sweet spot.
  • Q5/Q6: step up one level for sensitive tasks like code and math.
  • Below Q3: quality loss becomes visibly large — use only when memory is truly scarce.

Q4 is plenty for summarization and classification, but remember that code generation, math, and reasoning are more sensitive to quantization.

KV Cache Dominates Weights at Long Context

Don't look at weights alone. As context grows, the KV cache eats memory.

KV cache = 2 × num_layers × num_KV_heads × head_dim × num_tokens × bytes_per_elem

For Llama-3-8B (32 layers, 8 KV heads, head_dim 128, FP16), that's about 128KB per token. So:

  • 8K context: ~1.0GB
  • 128K context: ~16.8GB — larger than the weights (Q4 ~4.6GB).

This is why GQA (Grouped-Query Attention), adopted by recent small models, is decisive: it cuts the cache roughly 4× by reducing KV heads. You can also quantize the KV cache itself (Q8/Q4 KV).

Getting a Feel for Throughput

EnvironmentModelThroughput
Raspberry Pi 53B Q42–7 tok/s
CPU-only server7B Q43–12 tok/s (memory-bandwidth bound)
M4 Max (Mac)7B Q4_0~83 tok/s
RTX 4090 24GB8B Q4~95–110 tok/s (single request)

A Mac's unified memory lets the entire RAM pool act as VRAM, so 16GB handles 7–8B and 128GB can target up to 70B Q4. A single 80GB datacenter card (A100/H100) fits 32B FP16 or 70B INT4 on one card.

Choosing an Inference Engine

The same model fits different scenarios depending on the engine.

ScenarioRecommended engineWhy
Laptop dev / air-gapped endpointsllama.cpp, Ollama, LM Studio, llamafileCPU+GPU hybrid, GGUF, offline by default
Single-GPU dev / small servingOllama, llama.cpp (llama-server), vLLMOpenAI-compatible API, light ops
Multi-GPU production serving (throughput)vLLM, SGLangContinuous batching, tensor/pipeline parallelism, paged KV

Points worth noting from an air-gapped / on-premise angle:

  • The llama.cpp family (including Ollama, LM Studio, llamafile) has first-class ARM64 support, handling NEON and Apple Silicon Metal natively. The single-binary llamafile is especially strong for air-gapped distribution.
  • vLLM also ships Arm CPU wheels since 0.11.2 (tested on Graviton3), and for GPU serving it separately optimizes prefill (compute-bound) and decode (memory-bound) via chunked prefill and continuous batching. If prefill latency is your bottleneck, this feature in vLLM/SGLang is a direct candidate solution.
  • TGI (Text Generation Inference) is now in maintenance mode. For new builds, prefer vLLM or SGLang.

The Big Picture in 2026: Efficient Architectures and Data Sovereignty

Two forces drive the small/efficient model trend.

Architectural efficiency. Large open-weight models are driving down long-context cost with sparse attention. DeepSeek's DSA (Sparse Attention) was established in V3.2 (Dec 2025, MIT), and Z.ai's GLM-5.2 (Jun 2026, MIT, ~753B/40B active, 1M context) cut per-token FLOPs at 1M context with a DSA-based IndexShare. MiniMax M3, using its own MSA (MiniMax Sparse Attention), reports cutting per-token compute at 1M context to about 1/20 of the previous generation, with 9×+ prefill and 15×+ decode speedups. These are large models, but sparse/linear attention designs are clearly worth putting on the evaluation shortlist.

Data sovereignty and cost. The reasons organizations choose self-hosting are clear: regulations like ITAR, HIPAA, and GDPR; zero-egress requirements; and cost. Open-weight (especially Apache 2.0 / MIT) weights only need to be downloaded once and run inside the network, fundamentally resolving data-residency concerns. Gartner expects more than half of enterprise GenAI models to be domain-specific by 2027 (up from ~1% in 2024).

Adoption Checklist

Items to check before putting a small model into a new workload.

  • Pin the task type — classification / summarization / RAG / tool-calling / coding / agent (thresholds differ)
  • Pick a size above the threshold — meet the per-task minimum from the table above
  • Confirm effective context — design around effective length (typically 16–32K), not the advertised length
  • Decide quantization — default to Q4_K_M; Q5/Q6 for code and math; avoid below Q3
  • Estimate memory — weights + KV cache (factoring in context length) + overhead
  • Match hardware — laptop / single GPU / datacenter card
  • Choose an inference engine — llama.cpp family for air-gapped/ARM64; vLLM/SGLang for production serving
  • Check the license — Apache 2.0 / MIT are safe; verify terms for Llama / Gemma licenses

The era of small models is an era not of "size" but of "the right size for the task." Define the workload precisely, and the range you can run inside an air-gapped network — with no external API — is wider than you'd think.


References