Blog
llmmodel-comparisonopen-sourcequantization

The 2026 AI Model Landscape — Openness, Parameters, Quantization, Memory, and Use Cases

From GPT-5.5, Claude, Gemini, Llama 4, Qwen3, and DeepSeek to Korea's EXAONE, Solar, and Kanana — a June 2026 comparison of major models by openness, parameters, context, quantization, required VRAM, and use case.

Data DynamicsJune 24, 202612 min read

When choosing which model to use, the questions we actually ask are these: "Is it an open model, will it fit on our GPU, and what is it good for?" This article is a snapshot table that brings the major models, as of June 2026, together in one place — organized by openness, parameters, context, quantization, required memory (VRAM), and primary use case.

⚠️ This changes fast. The figures below are current as of June 2026, and the parameters and internal architecture of closed models are mostly undisclosed. VRAM numbers are "ballpark figures to give you a feel," and they vary with context length, batch size, and runtime. If the terms are unfamiliar, see the AI glossary; for quantization and fine-tuning, see the LoRA, QLoRA, and DoRA article.

The big picture

The model market splits broadly into closed frontier models (commercial APIs) and open-weight models (which you can host yourself), and within each there are general-purpose, reasoning, code, multimodal, embedding, small, and Korean-specialized models.

Loading diagram…

First: getting a feel for required memory (VRAM)

The first wall you hit when running open models yourself is GPU memory. The core formula is simple.

Weight VRAM ≈ number of parameters × precision (bytes). FP16 is about 2GB per 1B, 8-bit is about 1GB, and 4-bit is about 0.5GB. On top of that, add another 20–40% for the KV cache and activations.

ParametersFP168-bit4-bit (+overhead)GPU that can run 4-bit
1B~2GB~1GB~1GBlaptop/iGPU, 8GB
3B~6GB~3GB~2.5GB8GB
7~8B~16GB~8GB56GB8~12GB
13~14B~28GB~14GB910GB12~16GB
27~32B~60GB~32GB1820GB24GB (RTX 4090/5090)
70B~140GB~70GB4043GB48GB×1 or 24GB×2
~110B (MoE total)~220GB~110GB6067GB80GB (H100)×1
235B+130240GBmulti-GPU
671B (V3 class)380400GBmulti-node

Three practical points:

  • For MoE (mixture of experts), the "total parameters" determine memory and the "active parameters" determine speed. Example: DeepSeek-V3 has to load all 671B into memory, but it only computes 37B per token, so it is faster than a comparable dense model.
  • Long context consumes more memory through the KV cache. Actually filling 1M tokens can require additional memory on the order of the weights themselves.
  • Quantization cuts memory in half (8-bit) to a quarter (4-bit). Accuracy-preserving formats like QAT, AWQ, and GGUF have small losses.

Quantization formats at a glance

FormatWhereCharacteristics
GGUFllama.cpp / OllamaCPU- and consumer-GPU-friendly, the standard for local execution
AWQ / GPTQvLLM / TGI serving4-bit post-training quantization, accuracy-preserving
bitsandbytes (NF4)training / QLoRA4-bit loading + fine-tuning
FP8H100/B200 servingbalances speed and accuracy on the latest GPUs
QATtraining before deploymentquantization-aware training minimizes loss (e.g., Gemma 3)

A. Closed frontier (commercial APIs)

Parameters and architecture are undisclosed. You use them via API without worrying about memory, but your data leaves your environment and you pay per token.

ModelProviderContextModalityPrice (input/output, 1M tokens)Primary use case
GPT-5.5OpenAI1M (API)text · vision · computer use$5 / $30top-tier general purpose · agents · coding
GPT-5.5 ProOpenAI1Mmultimodal(higher tier)high-accuracy, high-difficulty tasks
GPT-5.5 InstantOpenAItext(ChatGPT default)fast conversation · everyday tasks
Claude Opus 4.8Anthropic1Mtext · vision$5 / $25complex reasoning · long-horizon agentic coding
Claude Sonnet 4.6Anthropic1Mtext · vision$3 / $15best value top-tier general purpose
Claude Haiku 4.5Anthropic200Ktext · vision$1 / $5high-speed, high-volume classification · routing · code review
Gemini 3.1 ProGoogle1Mmultimodal(premium)reasoning · complex agentic workflows
Gemini 3.5 FlashGoogle1Mmultimodal(mid-priced)agents · coding, balanced
Gemini 3.1 Flash-LiteGooglehigh capacitymultimodal(lowest price)high-volume · low-latency · cost-sensitive
Grok 4.3xAI1M~2Mtext · vision · video$1.25 / $2.50ultra-long-context · real-time · value frontier
Qwen3-MaxAlibabahigh capacitytext(API)1T+ class closed flagship

Anthropic released Mythos 5 and Fable 5 (higher tier, 1M context) on June 9 but temporarily suspended access on June 12. The lineup changes frequently, so check the official pricing and model pages.

B. Open-weight general purpose (can be self-hosted)

VRAM figures are approximate at 4-bit quantization. For MoE, both total and active parameters are shown.

ModelProviderParameters (active)ContextLicense4-bit VRAM (approx.)Primary use case
Llama 4 ScoutMeta109B total / 17B10MLlama Community~60GB (1× H100)ultra-long-context search · RAG
Llama 4 MaverickMeta400B total / 17B1MLlama Community~220GB (multi-GPU)multimodal · long-form generation
Qwen3.6 (dense 27B)Alibaba27B1MApache 2.0~16GB (24GB GPU)latest flagship dense
Qwen3.5Alibaba397B total / 17B262KApache 2.0~220GB (multi-GPU)large MoE · multimodal agents
Qwen3 32BAlibaba32B128K+Apache 2.01820GBhigh-performance general-purpose dense
Qwen3 235B-A22BAlibaba235B total / 22B128K+Apache 2.0~130GB (multi-GPU)top-tier open MoE
DeepSeek-V4 (preview)DeepSeeklarge MoE1MMITmulti-nodeultra-long-context production
DeepSeek-V3DeepSeek671B total / 37B128KMIT380400GB (multi-node)top-tier general-purpose open MoE
Mistral Large 3Mistral675B total / 41Bhigh capacityApache 2.0multi-nodelargest-class open MoE
Mistral Small 4Mistral~24B classhigh capacityApache 2.01415GBsingle model unifying reasoning + vision + coding
Gemma 3 27BGoogle27B128KGemma~16GB (QAT, 24GB GPU)open multimodal general purpose
Phi-4Microsoft14BMIT910GBsmall but strong reasoning SLM

C. Reasoning-specialized

These "think" longer before answering, making them strong in math, coding, and logic. They use many tokens, so cost and latency are high.

ModelProviderOpenCharacteristics
GPT-5.5 ThinkingOpenAIfrontier reasoning, strong agentic coding
Claude (extended thinking)Anthropicadaptive thinking mode for Opus/Sonnet
Gemini 3.1 Pro (adaptive thinking)Googleautomatically adjusts amount of thinking
DeepSeek-R1DeepSeek✓ (MIT)the flagship open reasoning model, distilled versions (1.5B~70B) available
Qwen3 ThinkingAlibabacan switch between thinking / non-thinking
Phi-4-reasoning(-plus)Microsoft✓ (MIT)14B small reasoning, good value
MagistralMistral(integrated into Small 4) reasoning-specialized

D. Specialized — code · multimodal

ModelCategoryOpenNotes
Qwen3-Coder 480B-A35Bcodelarge coding-only MoE
GPT-5.5-Codexcodeagentic coding (400K ctx)
Devstral / CodestralcodeMistral coding family
DeepSeek-Codercodeopen coding
Llama 4 (Scout/Maverick)multimodalnative multimodal (image + text)
Qwen-VL / Pixtralvisionimage understanding
Gemma 3 (4B and up)visionlightweight multimodal
Phi-4-multimodal / vision-15Bvision · audio✓ (MIT)small multimodal reasoning

These are not generative models; they convert text into vectors. The number of dimensions drives storage and search cost.

ModelProviderOpenCharacteristics · use case
text-embedding-3-large / smallOpenAIgeneral-purpose default, good-value small variant
Cohere embed-v4Coheremultilingual · hybrid (dense + sparse)
Voyage-3-large / 3.5Voyagetop-tier retrieval quality (high latency)
BGE-M3BAAIopen · multilingual · the self-hosting standard
Qwen3-EmbeddingAlibabaopen, good value
Jina v5Jina✓/✕excellent accuracy/cost for text RAG

F. Small / edge models (laptop · on-device)

These can run on small GPUs, CPUs, or on-device. At 4-bit, most run even on a laptop.

ModelParameters4-bit VRAMWhere it runs
Gemma 3 270M0.27B< 0.5GBCPU · mobile
Qwen3 0.6B / 1.7B0.6 / 1.7B~0.5 / ~1.5GBlaptop · iGPU
Llama 3.2 1B / 3B1 / 3B~1 / ~2.5GBlaptop · edge
Gemma 3 1B / 4B1 / 4B~1 / ~3GBlaptop (4B supports vision)
Ministral 3B3B~2GBedge · on-device
Phi-4-mini3.8B~2.5GBlaptop, built-in function calling

G. Korean-specialized (domestic models)

These models matter from the perspective of Korean-language quality, domestic regulation, and sovereign AI. Many are released as open weights, so they can be self-hosted.

ModelProviderParameters (active)Open4-bit VRAM (approx.)Characteristics · use case
EXAONE 4.0LG AI30B✓ (HF)~18GBcompetitive on global benchmarks, general purpose
EXAONE 4.5LG AI(multimodal)variabletext · image multimodal reasoning
HyperCLOVA X (Seed 32B Think)Naver32B~19GBKorean search · reasoning
HyperCLOVA X 8B OmniNaver8B~6GBlightweight multimodal
Solar Pro 2Upstage31Bpartial~18GBKorea's first global-class frontier
Kanana-2Kakao70B total / MoE (8 active)~40GBagents · long-form reasoning, 128K, Korean MMLU 89%
Mi:dm 2.0 Base / MiniKT11.5B / 2.3B✓ (MIT)~7GB / ~1.5GBKorea-specialized general purpose / lightweight
Mi:dm K 2.5 ProKT32B~19GBenhanced knowledge · reasoning, 128K
A.X 4.0 / 3.1SKT34B~20GBfrom-scratch Korean, includes lightweight variant
A.X K1SKT519Bmulti-nodesovereign ultra-large model

"What can I run on this GPU?" (at 4-bit)

Loading diagram…

Model selection guide

Loading diagram…

In short, the choice comes down to three paths. (1) If data can't leave your environment, self-host open weights — here GPU memory determines how big a model you can run. (2) If Korean is essential, look first at domestic models. (3) If you want minimal ops burden and top performance, use a closed API, which further divides into performance, value, and low-cost/high-volume.

Wrapping up

The model landscape of 2026 takes the shape of "closed models leading on performance, open weights catching up fast, and domestic models establishing themselves as sovereign AI." The starting point for a practical choice is always the same — narrow down in the order data policy → need for Korean → the GPU (memory) you have → use case, and the candidates naturally shrink.

This table is a living document. Models change every month, so just before adoption, check each provider's official model and pricing pages once more for the latest figures. For concrete ways to reduce memory through quantization, see the LoRA, QLoRA, and DoRA article.