The 2026 AI Model Landscape — Openness, Parameters, Quantization, Memory, and Use Cases
From GPT-5.5, Claude, Gemini, Llama 4, Qwen3, and DeepSeek to Korea's EXAONE, Solar, and Kanana — a June 2026 comparison of major models by openness, parameters, context, quantization, required VRAM, and use case.
When choosing which model to use, the questions we actually ask are these: "Is it an open model, will it fit on our GPU, and what is it good for?" This article is a snapshot table that brings the major models, as of June 2026, together in one place — organized by openness, parameters, context, quantization, required memory (VRAM), and primary use case.
⚠️ This changes fast. The figures below are current as of June 2026, and the parameters and internal architecture of closed models are mostly undisclosed. VRAM numbers are "ballpark figures to give you a feel," and they vary with context length, batch size, and runtime. If the terms are unfamiliar, see the AI glossary; for quantization and fine-tuning, see the LoRA, QLoRA, and DoRA article.
The big picture
The model market splits broadly into closed frontier models (commercial APIs) and open-weight models (which you can host yourself), and within each there are general-purpose, reasoning, code, multimodal, embedding, small, and Korean-specialized models.
First: getting a feel for required memory (VRAM)
The first wall you hit when running open models yourself is GPU memory. The core formula is simple.
Weight VRAM ≈ number of parameters × precision (bytes). FP16 is about 2GB per 1B, 8-bit is about 1GB, and 4-bit is about 0.5GB. On top of that, add another 20–40% for the KV cache and activations.
| Parameters | FP16 | 8-bit | 4-bit (+overhead) | GPU that can run 4-bit |
|---|---|---|---|---|
| 1B | ~2GB | ~1GB | ~1GB | laptop/iGPU, 8GB |
| 3B | ~6GB | ~3GB | ~2.5GB | 8GB |
| 7~8B | ~16GB | ~8GB | 8~12GB | |
| 13~14B | ~28GB | ~14GB | 12~16GB | |
| 27~32B | ~60GB | ~32GB | 24GB (RTX 4090/5090) | |
| 70B | ~140GB | ~70GB | 48GB×1 or 24GB×2 | |
| ~110B (MoE total) | ~220GB | ~110GB | 80GB (H100)×1 | |
| 235B+ | — | — | multi-GPU | |
| 671B (V3 class) | — | — | multi-node |
Three practical points:
- For MoE (mixture of experts), the "total parameters" determine memory and the "active parameters" determine speed. Example: DeepSeek-V3 has to load all 671B into memory, but it only computes 37B per token, so it is faster than a comparable dense model.
- Long context consumes more memory through the KV cache. Actually filling 1M tokens can require additional memory on the order of the weights themselves.
- Quantization cuts memory in half (8-bit) to a quarter (4-bit). Accuracy-preserving formats like QAT, AWQ, and GGUF have small losses.
Quantization formats at a glance
| Format | Where | Characteristics |
|---|---|---|
| GGUF | llama.cpp / Ollama | CPU- and consumer-GPU-friendly, the standard for local execution |
| AWQ / GPTQ | vLLM / TGI serving | 4-bit post-training quantization, accuracy-preserving |
| bitsandbytes (NF4) | training / QLoRA | 4-bit loading + fine-tuning |
| FP8 | H100/B200 serving | balances speed and accuracy on the latest GPUs |
| QAT | training before deployment | quantization-aware training minimizes loss (e.g., Gemma 3) |
A. Closed frontier (commercial APIs)
Parameters and architecture are undisclosed. You use them via API without worrying about memory, but your data leaves your environment and you pay per token.
| Model | Provider | Context | Modality | Price (input/output, 1M tokens) | Primary use case |
|---|---|---|---|---|---|
| GPT-5.5 | OpenAI | 1M (API) | text · vision · computer use | $5 / $30 | top-tier general purpose · agents · coding |
| GPT-5.5 Pro | OpenAI | 1M | multimodal | (higher tier) | high-accuracy, high-difficulty tasks |
| GPT-5.5 Instant | OpenAI | — | text | (ChatGPT default) | fast conversation · everyday tasks |
| Claude Opus 4.8 | Anthropic | 1M | text · vision | $5 / $25 | complex reasoning · long-horizon agentic coding |
| Claude Sonnet 4.6 | Anthropic | 1M | text · vision | $3 / $15 | best value top-tier general purpose |
| Claude Haiku 4.5 | Anthropic | 200K | text · vision | $1 / $5 | high-speed, high-volume classification · routing · code review |
| Gemini 3.1 Pro | 1M | multimodal | (premium) | reasoning · complex agentic workflows | |
| Gemini 3.5 Flash | 1M | multimodal | (mid-priced) | agents · coding, balanced | |
| Gemini 3.1 Flash-Lite | high capacity | multimodal | (lowest price) | high-volume · low-latency · cost-sensitive | |
| Grok 4.3 | xAI | 1M~2M | text · vision · video | $1.25 / $2.50 | ultra-long-context · real-time · value frontier |
| Qwen3-Max | Alibaba | high capacity | text | (API) | 1T+ class closed flagship |
Anthropic released Mythos 5 and Fable 5 (higher tier, 1M context) on June 9 but temporarily suspended access on June 12. The lineup changes frequently, so check the official pricing and model pages.
B. Open-weight general purpose (can be self-hosted)
VRAM figures are approximate at 4-bit quantization. For MoE, both total and active parameters are shown.
| Model | Provider | Parameters (active) | Context | License | 4-bit VRAM (approx.) | Primary use case |
|---|---|---|---|---|---|---|
| Llama 4 Scout | Meta | 109B total / 17B | 10M | Llama Community | ~60GB (1× H100) | ultra-long-context search · RAG |
| Llama 4 Maverick | Meta | 400B total / 17B | 1M | Llama Community | ~220GB (multi-GPU) | multimodal · long-form generation |
| Qwen3.6 (dense 27B) | Alibaba | 27B | 1M | Apache 2.0 | ~16GB (24GB GPU) | latest flagship dense |
| Qwen3.5 | Alibaba | 397B total / 17B | 262K | Apache 2.0 | ~220GB (multi-GPU) | large MoE · multimodal agents |
| Qwen3 32B | Alibaba | 32B | 128K+ | Apache 2.0 | high-performance general-purpose dense | |
| Qwen3 235B-A22B | Alibaba | 235B total / 22B | 128K+ | Apache 2.0 | ~130GB (multi-GPU) | top-tier open MoE |
| DeepSeek-V4 (preview) | DeepSeek | large MoE | 1M | MIT | multi-node | ultra-long-context production |
| DeepSeek-V3 | DeepSeek | 671B total / 37B | 128K | MIT | top-tier general-purpose open MoE | |
| Mistral Large 3 | Mistral | 675B total / 41B | high capacity | Apache 2.0 | multi-node | largest-class open MoE |
| Mistral Small 4 | Mistral | ~24B class | high capacity | Apache 2.0 | single model unifying reasoning + vision + coding | |
| Gemma 3 27B | 27B | 128K | Gemma | ~16GB (QAT, 24GB GPU) | open multimodal general purpose | |
| Phi-4 | Microsoft | 14B | — | MIT | small but strong reasoning SLM |
C. Reasoning-specialized
These "think" longer before answering, making them strong in math, coding, and logic. They use many tokens, so cost and latency are high.
| Model | Provider | Open | Characteristics |
|---|---|---|---|
| GPT-5.5 Thinking | OpenAI | ✕ | frontier reasoning, strong agentic coding |
| Claude (extended thinking) | Anthropic | ✕ | adaptive thinking mode for Opus/Sonnet |
| Gemini 3.1 Pro (adaptive thinking) | ✕ | automatically adjusts amount of thinking | |
| DeepSeek-R1 | DeepSeek | ✓ (MIT) | the flagship open reasoning model, distilled versions (1.5B~70B) available |
| Qwen3 Thinking | Alibaba | ✓ | can switch between thinking / non-thinking |
| Phi-4-reasoning(-plus) | Microsoft | ✓ (MIT) | 14B small reasoning, good value |
| Magistral | Mistral | ✓ | (integrated into Small 4) reasoning-specialized |
D. Specialized — code · multimodal
| Model | Category | Open | Notes |
|---|---|---|---|
| Qwen3-Coder 480B-A35B | code | ✓ | large coding-only MoE |
| GPT-5.5-Codex | code | ✕ | agentic coding (400K ctx) |
| Devstral / Codestral | code | ✓ | Mistral coding family |
| DeepSeek-Coder | code | ✓ | open coding |
| Llama 4 (Scout/Maverick) | multimodal | ✓ | native multimodal (image + text) |
| Qwen-VL / Pixtral | vision | ✓ | image understanding |
| Gemma 3 (4B and up) | vision | ✓ | lightweight multimodal |
| Phi-4-multimodal / vision-15B | vision · audio | ✓ (MIT) | small multimodal reasoning |
E. Embedding models (for RAG / search)
These are not generative models; they convert text into vectors. The number of dimensions drives storage and search cost.
| Model | Provider | Open | Characteristics · use case |
|---|---|---|---|
| text-embedding-3-large / small | OpenAI | ✕ | general-purpose default, good-value small variant |
| Cohere embed-v4 | Cohere | ✕ | multilingual · hybrid (dense + sparse) |
| Voyage-3-large / 3.5 | Voyage | ✕ | top-tier retrieval quality (high latency) |
| BGE-M3 | BAAI | ✓ | open · multilingual · the self-hosting standard |
| Qwen3-Embedding | Alibaba | ✓ | open, good value |
| Jina v5 | Jina | ✓/✕ | excellent accuracy/cost for text RAG |
F. Small / edge models (laptop · on-device)
These can run on small GPUs, CPUs, or on-device. At 4-bit, most run even on a laptop.
| Model | Parameters | 4-bit VRAM | Where it runs |
|---|---|---|---|
| Gemma 3 270M | 0.27B | < 0.5GB | CPU · mobile |
| Qwen3 0.6B / 1.7B | 0.6 / 1.7B | ~0.5 / ~1.5GB | laptop · iGPU |
| Llama 3.2 1B / 3B | 1 / 3B | ~1 / ~2.5GB | laptop · edge |
| Gemma 3 1B / 4B | 1 / 4B | ~1 / ~3GB | laptop (4B supports vision) |
| Ministral 3B | 3B | ~2GB | edge · on-device |
| Phi-4-mini | 3.8B | ~2.5GB | laptop, built-in function calling |
G. Korean-specialized (domestic models)
These models matter from the perspective of Korean-language quality, domestic regulation, and sovereign AI. Many are released as open weights, so they can be self-hosted.
| Model | Provider | Parameters (active) | Open | 4-bit VRAM (approx.) | Characteristics · use case |
|---|---|---|---|---|---|
| EXAONE 4.0 | LG AI | 30B | ✓ (HF) | ~18GB | competitive on global benchmarks, general purpose |
| EXAONE 4.5 | LG AI | (multimodal) | ✓ | variable | text · image multimodal reasoning |
| HyperCLOVA X (Seed 32B Think) | Naver | 32B | ✓ | ~19GB | Korean search · reasoning |
| HyperCLOVA X 8B Omni | Naver | 8B | ✓ | ~6GB | lightweight multimodal |
| Solar Pro 2 | Upstage | 31B | partial | ~18GB | Korea's first global-class frontier |
| Kanana-2 | Kakao | 70B total / MoE (8 active) | ✓ | ~40GB | agents · long-form reasoning, 128K, Korean MMLU 89% |
| Mi:dm 2.0 Base / Mini | KT | 11.5B / 2.3B | ✓ (MIT) | ~7GB / ~1.5GB | Korea-specialized general purpose / lightweight |
| Mi:dm K 2.5 Pro | KT | 32B | ✓ | ~19GB | enhanced knowledge · reasoning, 128K |
| A.X 4.0 / 3.1 | SKT | 34B | ✓ | ~20GB | from-scratch Korean, includes lightweight variant |
| A.X K1 | SKT | 519B | ✓ | multi-node | sovereign ultra-large model |
"What can I run on this GPU?" (at 4-bit)
Model selection guide
In short, the choice comes down to three paths. (1) If data can't leave your environment, self-host open weights — here GPU memory determines how big a model you can run. (2) If Korean is essential, look first at domestic models. (3) If you want minimal ops burden and top performance, use a closed API, which further divides into performance, value, and low-cost/high-volume.
Wrapping up
The model landscape of 2026 takes the shape of "closed models leading on performance, open weights catching up fast, and domestic models establishing themselves as sovereign AI." The starting point for a practical choice is always the same — narrow down in the order data policy → need for Korean → the GPU (memory) you have → use case, and the candidates naturally shrink.
This table is a living document. Models change every month, so just before adoption, check each provider's official model and pricing pages once more for the latest figures. For concrete ways to reduce memory through quantization, see the LoRA, QLoRA, and DoRA article.