llmmodel-comparisonopen-sourcequantization

The 2026 AI Model Landscape — Openness, Parameters, Quantization, Memory, and Use Cases

From GPT-5.5, Claude, Gemini, Llama 4, Qwen3, and DeepSeek to Korea's EXAONE, Solar, and Kanana — a June 2026 comparison of major models by openness, parameters, context, quantization, required VRAM, and use case.

Data DynamicsJune 24, 202612 min read

When choosing which model to use, the questions we actually ask are these: "Is it an open model, will it fit on our GPU, and what is it good for?" This article is a snapshot table that brings the major models, as of June 2026, together in one place — organized by openness, parameters, context, quantization, required memory (VRAM), and primary use case.

⚠️ This changes fast. The figures below are current as of June 2026, and the parameters and internal architecture of closed models are mostly undisclosed. VRAM numbers are "ballpark figures to give you a feel," and they vary with context length, batch size, and runtime. If the terms are unfamiliar, see the AI glossary; for quantization and fine-tuning, see the LoRA, QLoRA, and DoRA article.

The big picture

The model market splits broadly into closed frontier models (commercial APIs) and open-weight models (which you can host yourself), and within each there are general-purpose, reasoning, code, multimodal, embedding, small, and Korean-specialized models.

Loading diagram…

First: getting a feel for required memory (VRAM)

The first wall you hit when running open models yourself is GPU memory. The core formula is simple.

Weight VRAM ≈ number of parameters × precision (bytes). FP16 is about 2GB per 1B, 8-bit is about 1GB, and 4-bit is about 0.5GB. On top of that, add another 20–40% for the KV cache and activations.

Parameters	FP16	8-bit	4-bit (+overhead)	GPU that can run 4-bit
1B	~2GB	~1GB	~1GB	laptop/iGPU, 8GB
3B	~6GB	~3GB	~2.5GB	8GB
7~8B	~16GB	~8GB	56GB	8~12GB
13~14B	~28GB	~14GB	910GB	12~16GB
27~32B	~60GB	~32GB	1820GB	24GB (RTX 4090/5090)
70B	~140GB	~70GB	4043GB	48GB×1 or 24GB×2
~110B (MoE total)	~220GB	~110GB	6067GB	80GB (H100)×1
235B+	—	—	~~130~~240GB	multi-GPU
671B (V3 class)	—	—	~~380~~400GB	multi-node

Three practical points:

For MoE (mixture of experts), the "total parameters" determine memory and the "active parameters" determine speed. Example: DeepSeek-V3 has to load all 671B into memory, but it only computes 37B per token, so it is faster than a comparable dense model.
Long context consumes more memory through the KV cache. Actually filling 1M tokens can require additional memory on the order of the weights themselves.
Quantization cuts memory in half (8-bit) to a quarter (4-bit). Accuracy-preserving formats like QAT, AWQ, and GGUF have small losses.

Quantization formats at a glance

Format	Where	Characteristics
GGUF	llama.cpp / Ollama	CPU- and consumer-GPU-friendly, the standard for local execution
AWQ / GPTQ	vLLM / TGI serving	4-bit post-training quantization, accuracy-preserving
bitsandbytes (NF4)	training / QLoRA	4-bit loading + fine-tuning
FP8	H100/B200 serving	balances speed and accuracy on the latest GPUs
QAT	training before deployment	quantization-aware training minimizes loss (e.g., Gemma 3)

A. Closed frontier (commercial APIs)

Parameters and architecture are undisclosed. You use them via API without worrying about memory, but your data leaves your environment and you pay per token.

Model	Provider	Context	Modality	Price (input/output, 1M tokens)	Primary use case
GPT-5.5	OpenAI	1M (API)	text · vision · computer use	$5 / $30	top-tier general purpose · agents · coding
GPT-5.5 Pro	OpenAI	1M	multimodal	(higher tier)	high-accuracy, high-difficulty tasks
GPT-5.5 Instant	OpenAI	—	text	(ChatGPT default)	fast conversation · everyday tasks
Claude Opus 4.8	Anthropic	1M	text · vision	$5 / $25	complex reasoning · long-horizon agentic coding
Claude Sonnet 4.6	Anthropic	1M	text · vision	$3 / $15	best value top-tier general purpose
Claude Haiku 4.5	Anthropic	200K	text · vision	$1 / $5	high-speed, high-volume classification · routing · code review
Gemini 3.1 Pro	Google	1M	multimodal	(premium)	reasoning · complex agentic workflows
Gemini 3.5 Flash	Google	1M	multimodal	(mid-priced)	agents · coding, balanced
Gemini 3.1 Flash-Lite	Google	high capacity	multimodal	(lowest price)	high-volume · low-latency · cost-sensitive
Grok 4.3	xAI	1M~2M	text · vision · video	$1.25 / $2.50	ultra-long-context · real-time · value frontier
Qwen3-Max	Alibaba	high capacity	text	(API)	1T+ class closed flagship

Anthropic released Mythos 5 and Fable 5 (higher tier, 1M context) on June 9 but temporarily suspended access on June 12. The lineup changes frequently, so check the official pricing and model pages.

B. Open-weight general purpose (can be self-hosted)

VRAM figures are approximate at 4-bit quantization. For MoE, both total and active parameters are shown.

Model	Provider	Parameters (active)	Context	License	4-bit VRAM (approx.)	Primary use case
Llama 4 Scout	Meta	109B total / 17B	10M	Llama Community	~60GB (1× H100)	ultra-long-context search · RAG
Llama 4 Maverick	Meta	400B total / 17B	1M	Llama Community	~220GB (multi-GPU)	multimodal · long-form generation
Qwen3.6 (dense 27B)	Alibaba	27B	1M	Apache 2.0	~16GB (24GB GPU)	latest flagship dense
Qwen3.5	Alibaba	397B total / 17B	262K	Apache 2.0	~220GB (multi-GPU)	large MoE · multimodal agents
Qwen3 32B	Alibaba	32B	128K+	Apache 2.0	1820GB	high-performance general-purpose dense
Qwen3 235B-A22B	Alibaba	235B total / 22B	128K+	Apache 2.0	~130GB (multi-GPU)	top-tier open MoE
DeepSeek-V4 (preview)	DeepSeek	large MoE	1M	MIT	multi-node	ultra-long-context production
DeepSeek-V3	DeepSeek	671B total / 37B	128K	MIT	~~380~~400GB (multi-node)	top-tier general-purpose open MoE
Mistral Large 3	Mistral	675B total / 41B	high capacity	Apache 2.0	multi-node	largest-class open MoE
Mistral Small 4	Mistral	~24B class	high capacity	Apache 2.0	1415GB	single model unifying reasoning + vision + coding
Gemma 3 27B	Google	27B	128K	Gemma	~16GB (QAT, 24GB GPU)	open multimodal general purpose
Phi-4	Microsoft	14B	—	MIT	910GB	small but strong reasoning SLM

C. Reasoning-specialized

These "think" longer before answering, making them strong in math, coding, and logic. They use many tokens, so cost and latency are high.

Model	Provider	Open	Characteristics
GPT-5.5 Thinking	OpenAI	✕	frontier reasoning, strong agentic coding
Claude (extended thinking)	Anthropic	✕	adaptive thinking mode for Opus/Sonnet
Gemini 3.1 Pro (adaptive thinking)	Google	✕	automatically adjusts amount of thinking
DeepSeek-R1	DeepSeek	✓ (MIT)	the flagship open reasoning model, distilled versions (1.5B~70B) available
Qwen3 Thinking	Alibaba	✓	can switch between thinking / non-thinking
Phi-4-reasoning(-plus)	Microsoft	✓ (MIT)	14B small reasoning, good value
Magistral	Mistral	✓	(integrated into Small 4) reasoning-specialized

D. Specialized — code · multimodal

Model	Category	Open	Notes
Qwen3-Coder 480B-A35B	code	✓	large coding-only MoE
GPT-5.5-Codex	code	✕	agentic coding (400K ctx)
Devstral / Codestral	code	✓	Mistral coding family
DeepSeek-Coder	code	✓	open coding
Llama 4 (Scout/Maverick)	multimodal	✓	native multimodal (image + text)
Qwen-VL / Pixtral	vision	✓	image understanding
Gemma 3 (4B and up)	vision	✓	lightweight multimodal
Phi-4-multimodal / vision-15B	vision · audio	✓ (MIT)	small multimodal reasoning

E. Embedding models (for RAG / search)

These are not generative models; they convert text into vectors. The number of dimensions drives storage and search cost.

Model	Provider	Open	Characteristics · use case
text-embedding-3-large / small	OpenAI	✕	general-purpose default, good-value small variant
Cohere embed-v4	Cohere	✕	multilingual · hybrid (dense + sparse)
Voyage-3-large / 3.5	Voyage	✕	top-tier retrieval quality (high latency)
BGE-M3	BAAI	✓	open · multilingual · the self-hosting standard
Qwen3-Embedding	Alibaba	✓	open, good value
Jina v5	Jina	✓/✕	excellent accuracy/cost for text RAG

F. Small / edge models (laptop · on-device)

These can run on small GPUs, CPUs, or on-device. At 4-bit, most run even on a laptop.

Model	Parameters	4-bit VRAM	Where it runs
Gemma 3 270M	0.27B	< 0.5GB	CPU · mobile
Qwen3 0.6B / 1.7B	0.6 / 1.7B	~0.5 / ~1.5GB	laptop · iGPU
Llama 3.2 1B / 3B	1 / 3B	~1 / ~2.5GB	laptop · edge
Gemma 3 1B / 4B	1 / 4B	~1 / ~3GB	laptop (4B supports vision)
Ministral 3B	3B	~2GB	edge · on-device
Phi-4-mini	3.8B	~2.5GB	laptop, built-in function calling

G. Korean-specialized (domestic models)

These models matter from the perspective of Korean-language quality, domestic regulation, and sovereign AI. Many are released as open weights, so they can be self-hosted.

Model	Provider	Parameters (active)	Open	4-bit VRAM (approx.)	Characteristics · use case
EXAONE 4.0	LG AI	30B	✓ (HF)	~18GB	competitive on global benchmarks, general purpose
EXAONE 4.5	LG AI	(multimodal)	✓	variable	text · image multimodal reasoning
HyperCLOVA X (Seed 32B Think)	Naver	32B	✓	~19GB	Korean search · reasoning
HyperCLOVA X 8B Omni	Naver	8B	✓	~6GB	lightweight multimodal
Solar Pro 2	Upstage	31B	partial	~18GB	Korea's first global-class frontier
Kanana-2	Kakao	70B total / MoE (8 active)	✓	~40GB	agents · long-form reasoning, 128K, Korean MMLU 89%
Mi:dm 2.0 Base / Mini	KT	11.5B / 2.3B	✓ (MIT)	~7GB / ~1.5GB	Korea-specialized general purpose / lightweight
Mi:dm K 2.5 Pro	KT	32B	✓	~19GB	enhanced knowledge · reasoning, 128K
A.X 4.0 / 3.1	SKT	34B	✓	~20GB	from-scratch Korean, includes lightweight variant
A.X K1	SKT	519B	✓	multi-node	sovereign ultra-large model

"What can I run on this GPU?" (at 4-bit)

Loading diagram…

Model selection guide

Loading diagram…

In short, the choice comes down to three paths. (1) If data can't leave your environment, self-host open weights — here GPU memory determines how big a model you can run. (2) If Korean is essential, look first at domestic models. (3) If you want minimal ops burden and top performance, use a closed API, which further divides into performance, value, and low-cost/high-volume.

Wrapping up

The model landscape of 2026 takes the shape of "closed models leading on performance, open weights catching up fast, and domestic models establishing themselves as sovereign AI." The starting point for a practical choice is always the same — narrow down in the order data policy → need for Korean → the GPU (memory) you have → use case, and the candidates naturally shrink.

This table is a living document. Models change every month, so just before adoption, check each provider's official model and pricing pages once more for the latest figures. For concrete ways to reduce memory through quantization, see the LoRA, QLoRA, and DoRA article.