AI Creative Writing Arena Leaderboard 2026


Core Insight

Creative writing is where raw intelligence bows to taste, restraint, and the courage to leave the right things unsaid.

Three years of asking AI to tell me stories. Not summaries, not outlines—actual fiction. The kind where a character walks into a room and you feel the temperature change. Over those years, I've watched this leaderboard transform from a curiosity into a genuine barometer of literary capability. February 2026 brought the most interesting shift yet: a brand-new model that arrived quietly, climbed fast, and narrowed a gap that seemed permanent just weeks ago. Here's the full picture—sixty models ranked, analyzed, and put in context by someone who works with them every day.

The Creative Writing Leaderboard

Code has syntax. Math has proofs. But creative writing has voice—rhythm, surprise, emotional resonance. This is the Creative Writing Arena, the most demanding benchmark in AI evaluation, where sixty models are ranked by how well they tell stories that actually move people. Here's where everything stands as of February 2026.

Rank Model Score Votes Organization
🥇
Gemini 3 Pro 14904,861Google
🥈
Claude Opus 4 6 1478347Anthropic
🥉
Claude Opus 4 5 20251101 Thinking 32k 14593,667Anthropic
#4
Claude Opus 4 5 20251101 14574,382Anthropic
#5
Gemini 3 Flash 14563,678Google
#6
Gemini 2.5 Pro 145012,564Google
#7
Claude Sonnet 4 5 20250929 14475,769Anthropic
#8
Gemini 3 Flash (thinking Minimal) 14472,253Google
#9
Claude Opus 4 1 20250805 Thinking 16k 14456,651Anthropic
#10
Claude Sonnet 4 5 20250929 Thinking 32k 14426,015Anthropic
#11
Claude Opus 4 1 20250805 14409,807Anthropic
#12
Gpt 4.5 Preview 2025 02 27 14382,618OpenAI
#13
Grok 4.1 Thinking 14344,819xAI
#14
Gpt 5.1 High 14344,213OpenAI
#15
Claude Opus 4 20250514 Thinking 16k 14284,750Anthropic
#16
Grok 4.1 14275,119xAI
#17
Chatgpt 4o Latest 20250326 142211,146OpenAI
#18
Ernie 5.0 Preview 1203 14201,477Baidu
#19
Claude Opus 4 20250514 14195,794Anthropic
#20
Ernie 5.0 0110 14181,622Baidu
#21
Kimi K2.5 Thinking 14181,059Moonshot
#22
Deepseek V3.1 Terminus 1411458DeepSeek
#23
Gpt 5.1 14114,512OpenAI
#24
Ernie 5.0 Preview 1022 1411662Baidu
#25
Deepseek V3.1 Thinking 14101,720DeepSeek
#26
Grok 4 1 Fast Reasoning 14043,798xAI
#27
Glm 4.7 14031,797Z.ai
#28
Deepseek V3.2 Exp 14031,500DeepSeek
#29
Gpt 4.1 2025 04 14 14026,858OpenAI
#30
Glm 4.6 14024,764Z.ai
#31
Kimi K2.5 Instant 1402427Moonshot
#32
Grok 3 Preview 02 24 14024,972xAI
#33
Deepseek V3.2 13993,529DeepSeek
#34
Gemini 2.5 Flash 139812,294Google
#35
Gpt 5.2 13981,679OpenAI
#36
Grok 4 0709 13975,559xAI
#37
Qwen3 Max Preview 13963,713Alibaba
#38
Claude Sonnet 4 20250514 Thinking 32k 13964,582Anthropic
#39
Deepseek V3.1 13952,082DeepSeek
#40
Qwen3 Max 2025 09 23 13951,154Alibaba
#41
Claude 3 7 Sonnet 20250219 Thinking 32k 13955,472Anthropic
#42
Deepseek V3.2 Exp Thinking 13951,154DeepSeek
#43
Gpt 5 Chat 13944,010OpenAI
#44
Gpt 5.2 High 13942,133OpenAI
#45
Kimi K2 Thinking Turbo 13934,520Moonshot
#46
Deepseek V3 0324 13916,338DeepSeek
#47
Deepseek V3.2 Thinking 13903,113DeepSeek
#48
Deepseek R1 0528 13882,660DeepSeek
#49
Claude Sonnet 4 20250514 13855,328Anthropic
#50
Qwen3 235b A22b Instruct 2507 13849,102Alibaba
#51
O3 2025 04 16 13848,014OpenAI
#52
O1 2024 12 17 13834,646OpenAI
#53
Hunyuan T1 20250711 1382642Tencent
#54
Grok 4 Fast Chat 1382995xAI
#55
Gemini 2.5 Flash Preview 09 2025 13824,285Google
#56
Mistral Medium 2508 13828,527Mistral
#57
Claude Haiku 4 5 20251001 13825,754Anthropic
#58
Deepseek V3.1 Terminus Thinking 1381446DeepSeek
#59
Grok 4 Fast Reasoning 13802,372xAI
#60
Gpt 5 High 13794,330OpenAI

The February Disruption

When I pulled the latest data, one entry stopped me: Claude Opus 4.6 sitting at number two. Not because an Anthropic model placing high is unusual—they've been doing that consistently. But because this model landed at the second position with barely any evaluation history behind it. That kind of early consensus is rare. It means the first wave of testers—the obsessives who run identical prompts through every new release within hours of launch—found something genuinely different in its creative output.

The real story, though, is the gap. In January, the distance between first and second place was a comfortable twenty-five points. Now it's twelve. Gemini 3 Pro still holds gold, and it earned that position honestly. But the lead has halved in a single update cycle. If you're Google, that trend demands attention. If you're Anthropic, it's confirmation that your approach to creative AI training is converging on something powerful.

Meanwhile, the models just below the top two have reshuffled significantly. Claude Opus 4.5's thinking variant moved up to third, pushing the standard Opus 4.5 to fourth and Gemini 3 Flash down to fifth. Flash held third just last month. The podium isn't only changing hands at the summit—it's unstable throughout. And instability, in my experience, precedes breakthroughs.

Commanding Heights

Gemini 3 Pro remains the model I reach for when I don't know what I need yet. What keeps it at number one is range: ask it for Hemingway and it delivers spare, muscular prose. Ask for experimental postmodern fiction and it shifts register without losing coherence. Victorian epistolary, hardboiled noir, magic realism, children's literature—Gemini handles these transitions in a way that suggests genuine comprehension of form, not surface mimicry. Google places six models in the top sixty, with Gemini 3 Flash at five and Gemini 2.5 Pro at six filling out a strong trio at the top.

Claude is a different animal entirely. If Gemini is range, Claude is depth. Anthropic's models have always excelled at the subtleties hardest to teach a machine: when to let silence carry a scene, when a sentence should break instead of continuing, when what a character doesn't say reveals more than what they do. Opus 4.6 pushes this further. In my testing, it produced dialogue that felt genuinely inhabited. Characters weren't delivering lines—they were thinking, hesitating, choosing words the way real people do when something important hangs in the balance. Anthropic now has thirteen models in the top sixty, more than any other organization, with five placed in the top eleven. Whatever their approach to training creative capability, it's working across their entire product line.

Here's an observation that doesn't get enough attention: extended reasoning—the "thinking" mode—doesn't reliably improve creative writing. The pattern is inconsistent and deeply revealing.

For Claude Opus models, thinking variants tend to rank slightly higher: Opus 4.5 Thinking at three versus standard at four, Opus 4.1 Thinking at nine versus standard at eleven. Grok 4.1 Thinking outperforms its standard variant by three positions. But flip to other architectures and the pattern reverses—sometimes dramatically. DeepSeek v3.2-exp standard sits at twenty-eight while its thinking variant falls to forty-two. DeepSeek v3.1-terminus standard is at twenty-two; its thinking counterpart drops to fifty-eight—a thirty-six position gap. GPT-5.2 standard beats GPT-5.2-high.

What this tells me is important: creative writing isn't primarily a reasoning problem. It's an aesthetic one. For models that already possess strong literary instincts, extended thinking can refine those instincts—like a careful editor reviewing a solid first draft. But for models whose creative strength is more instinctive and pattern-driven, forcing deliberation actually polishes away the rough edges that make prose feel alive. Sometimes the first response captures something that additional computation smooths into mediocrity. If you use thinking-enabled models for creative work, test both modes. The assumption that more reasoning equals better output does not hold here, and understanding when to turn thinking off may be more valuable than knowing when to turn it on.

The Rising Tide

Below the top tier, the story is proliferation and diversity—and it's arguably more important than the race for number one.

DeepSeek places ten models in the top sixty, making it the third most-represented organization after Anthropic and OpenAI. Their v3.1 and v3.2 variants span from twenty-two to fifty-eight, covering a range of creative capability tiers. As an open-weight project, DeepSeek represents something fundamentally different from the proprietary leaders: these models can be downloaded, hosted locally, and fine-tuned for specific creative tasks. If you're building an AI writing tool or integrating creative capabilities into a product pipeline, DeepSeek offers flexibility that API-only models can't match.

The broader picture is even more striking. Between DeepSeek, Baidu, Moonshot, Alibaba, Z.ai, and Tencent, Chinese AI labs now account for twenty-two of sixty ranked models—over a third of the entire leaderboard. Moonshot's Kimi K2.5 debuted with its thinking variant at twenty-one, bringing the company to three placements. Baidu holds three positions with its ERNIE 5.0 lineup. Alibaba's Qwen3 has three variants ranked. Z.ai's GLM-4.7 sits at twenty-seven. This isn't convergence—it's genuine diversity. Different training data, different cultural contexts, and different literary traditions produce models with distinct creative sensibilities. I've seen ERNIE craft metaphors that wouldn't occur to Western-trained models, and GLM handle narrative pacing in ways that feel fresh precisely because the literary DNA is different. The global creative AI ecosystem is richer for it.

OpenAI holds eleven models, though their creative story has an interesting subplot. GPT-4.5-preview at twelve sits ahead of both GPT-5.1-high at fourteen and GPT-5.1 standard at twenty-three. Sometimes a model optimized for nuance outperforms its technically superior successor on tasks that prize subtlety over raw capability. ChatGPT-4o-latest at seventeen reinforces the point: conversation-optimized models carry an inherent advantage in creative writing because storytelling is fundamentally conversational. You're not computing an answer—you're sustaining a voice.

Grok has carved a genuine creative identity with seven models ranked. Where Claude excels at emotional intelligence, Grok brings emotional honesty. The humor is sharper, the metaphors bolder, the characters less polished and more alive. When I want writing that takes risks—fiction that might make a reader uncomfortable in a productive way—Grok is where I start. It's the model least afraid of its own voice, and in creative writing, fearlessness matters. Mistral's medium-2508 at fifty-six represents Europe's presence on the board. Tencent's Hunyuan at fifty-three adds yet another voice from China. The field has never been wider.

Where This All Goes

I'll tell you what I think happens next, because the trends in this data point somewhere specific.

The gap keeps compressing. The spread between first and sixtieth place is roughly 7.4 percent—tight by historical standards, and narrowing with every update. We are approaching a threshold where the meaningful differences between models shift from raw quality to creative personality. The question stops being "which model writes best" and becomes "which model's voice fits this particular project." That's a fundamental change in how writers and creative teams should think about AI selection.

Specialized creative models are inevitable. The general-purpose architecture has pushed creative writing quality remarkably far, but the next real leap will come from models explicitly tuned for narrative structure, character consistency, dialogue authenticity, or poetic form. I expect at least one major lab to ship a creative-specialist model by the second half of this year—one that commits entirely to literary capability rather than trying to solve math, write code, and tell stories simultaneously. When that happens, it will reset the top of this leaderboard overnight.

Open-weight models will close the remaining gap. DeepSeek's ten-model presence is the leading indicator. As open alternatives approach parity with proprietary systems in creative benchmarks, the economics of AI-assisted writing shift dramatically. Writers, studios, and publishers gain access to top-tier creative AI without per-token pricing, changing adoption curves and the fundamental relationship between human writers and AI tools.

The real frontier is orchestration, not isolation. The most sophisticated creative work I've seen recently doesn't use a single model—it uses three or four in sequence. Gemini for initial ideation and stylistic exploration. Claude for emotional refinement and dialogue polish. DeepSeek or Qwen for alternative cultural perspectives. Grok when the draft needs edge. The future isn't about crowning one model king. It's about learning to conduct an ensemble, matching each model's creative personality to the right moment in the writing process. The writers who figure this out first will produce work that feels unlike anything a single model—or a single human—could achieve alone.

Choosing Your Creative Partner

After years of writing alongside these models, here's what I've learned about matching the right tool to the right task:

Versatility

Gemini 3 Pro adapts to any genre, any form, any tone. When the brief is undefined or the project demands range, start here.

Emotional Depth

Claude Opus 4.6 writes with restraint and genuine feeling. For dialogue, character work, and prose where what's left unsaid matters most.

Speed & Quality

Gemini 3 Flash proves fast doesn't mean worse. For iterative drafting, high-volume projects, and rapid prototyping of narrative ideas.

Personality

Grok 4.1 takes creative risks that other models won't. For fiction that needs edge, humor, and characters who feel alive rather than assembled.

Enterprise

GPT-4.5 / GPT-5.1 deliver polished, reliable output that integrates into existing workflows. When consistency and brand safety matter as much as creativity.

Open Source

DeepSeek / Qwen: host it yourself, fine-tune for your domain. When you need creative AI at scale without per-token costs, the economics are unbeatable.

There is no single best creative AI. There are evolving voices with different strengths, and the real power lies in knowing which voice serves which moment in the story you're trying to tell.


Data Source: Rankings from Arena AI Creative Writing Leaderboard, February 6, 2026.

Discussion

0 comments

Leave a comment

Be the first to share your thoughts on this article!