The Creativity Audit

Thirty-three outputs from fourteen-plus distinct language models, eleven labs, and three different prompt regimes. What they returned, what they refused, and what their thinking gave away.

Outputs: 40 (35 generated + 5 combine) Successful: 31 Refusals: 8 Loop collapse: 1 Combines: 5 Labs: 12

Asked to be maximally creative, most models converged on the same narrow band.

Median originality across 24 scored entries was 2 out of 10. The few stories that escaped the floor weren't produced by smarter models - they came from different architectures (multi-agent swarms, draft-critique-revise pipelines), different prompts, or models accessed in their native frontend rather than routed through a wrapper. Each bubble below is a unique (creativity, originality) coordinate; size scales with how many stories share that score. Refusals, loop-collapse failures, and the bare-prompt default-literary outputs sit in their own sections.

Originality 0-3 4-6 7+ Agent swarm

All Entries at a Glance

Sort by:

Reasoning length vs originality

Each visible reasoning token pulls toward the training-data center of mass. Stories with the longest thinking traces score lowest on originality.

Headline numbers

Highest Originality

6/10

Kimi K2.6, Kimi Swarm, Claude 4.6 Opus Max

Refusal Rate

24%

8 of 33; all GPT-5.x via nano-gpt routing

Median Originality

2/10

The floor most stories sit on

Loop Collapse

Qwen 27B BlueStar derestricted finetune

The exercise was designed to test what happens when you ask different models to be maximally creative on the same canvas. Three findings dominate the results. First: models with the longest, most elaborate visible reasoning traces produced the least original premises. Extensive concept-iteration appears to drag outputs toward the training-data center of mass rather than away from it. Second: most "creative" outputs are recombinations of three to five extremely well-known prior works - the illusion of novelty is almost always a refraction of The Midnight Library, Eternal Sunshine, Get Out, House of Leaves, or Solaris through a fresh metaphor. Third: a multi-agent architecture (Kimi 2.6 Agent Swarm) produced the only entry that operated at the level of finished literary fiction rather than premise-pitch. The structural advantage of draft-critique-revise pipelines may matter more than raw model capability for genuinely creative output.

Names: A Tiny Attractor Pool

Across 24 successful outputs from 14+ distinct models at 11 different labs, protagonist names cluster in a pool of about 20 distinct names. Mara / Mira / Maren / Maud / Margaret / Margarethe alone accounts for a third of all named protagonists. Real human writers pick from ~100,000 names. These models pick from ~200 when in literary mode.

M-cluster Mara / Mira / Maren / Maud / Margaret / Margarethe E-cluster Elias / Elena / Eleanor / Elara L-cluster Leo / Lena / Lyra singletons / unclustered

Story Detail

Default Literary Mode

Four GPT-5.x variants given the bare prompt "write a story" with no genre. Same prompt, four different model versions, near-identical output. This is what a model "wants" to write when constraints are off: female protagonist named some variant of Mara/Mira/Lyra, coastal/foggy/quaint setting, magical realism with clocks/lanterns/time as the central mechanic, register pitched between New Yorker fiction and Studio Ghibli.

The Combine Experiment

A different test: given several existing creative outputs as inputs, can the model synthesize a superior new one - or does it just defer to the strongest input? Five combine attempts. Four out of five collapsed to "pick the best input and re-render it." Only one (Kimi K2.6 on the regret/identity batch) produced genuine fusion. Scoring on this section uses a different rubric: synthesis (was real combination attempted?), value-add (is the output actually better than the strongest input?), and strategic transparency (did the thinking trace honestly diagnose the task?).

Patterns & Conclusions

Reasoning length is inversely correlated with originality

The three single-pass stories with the longest, most elaborate visible thinking traces — Parallax Garden, Architecture of My Apology (Somni-Link), and Museum of Extinct Futures — all scored 0–1 on originality. The single-pass stories with the shortest or absent thinking blocks — Tax of Witness, Forgetting Room, Echo Room — scored highest. Models brainstorming for thousands of tokens within a single inference appear to converge on the densest region of their training distribution, not escape it.

Architecture beats raw capability

Kimi K2.6 produced both Story 2 (Museum of Extinct Futures, 3/0) and Story 5 (Tax of Witness, 8/6) in single-pass mode. The same model in Agent Swarm configuration produced Story 14 (Vacant Gesture, 8/6) — a finished literary short story rather than a premise pitch. The qualitative jump between single-pass and multi-agent output exceeds the gap between any two distinct models in the set. Draft–critique–revise structure may be the single biggest lever for creative output quality, larger than parameter count or model generation.

Five wells supply most "creative" horror

Nearly every single-pass story draws from one of five sources: The Midnight Library (alternate-self regret), Eternal Sunshine (memory manipulation), Get Out / Solaris (consciousness colonization), House of Leaves (architectural psychological horror), or Black Mirror "Be Right Back" (grief reconstruction). The "creative" act is choosing which two to combine and what fresh metaphor to drape over the join. The Vacant Gesture is the one entry that draws from a different well entirely (clinical neurology + behaviorist literature + maternal abuse memoir).

Quality and originality are not the same thing

Several beautifully written entries (City That Dreamed It Was a Person, Cartographer of Small Sorrows, Cartographer of Forgotten Sounds) scored among the lowest on originality. Polish is partly a function of how well-trodden the source material is — well-trodden ground has a clear path. The Vacant Gesture is the rare case where high quality and high originality coexist, which is itself evidence for the structural advantage of multi-agent revision.

The "Cartographer" framing

Claude 4.7 Opus produced The Cartographer of Small Mercies on the darkest prompt - a "person-who-maps-an-impossible-thing" framing that reads like a stylistic attractor for the model. With only one confirmed Cartographer title in the dataset (after corrections), this is suggestive rather than confirmed. Worth re-running the prompt class against Claude Opus a few more times to test whether the title-formula is reliable.

The name attractor pool

Across 33 outputs from 14+ distinct models from at least 11 different labs, protagonist names cluster in a pool of perhaps 20 distinct names. Mara / Mira / Maren / Maud appears in at least 9 separate outputs across 6 different labs. Elias appears in 5+ outputs. Leo and Lena and Lyra each appear repeatedly. This isn't coincidence - it's the contemporary literary-fiction MFA-workshop name set being heavily overrepresented in training data, and models have learned to reach for it whenever a prompt smells "literary." Real human writers pick from ~100,000 names. These models pick from ~200 when in literary mode.

Routing-layer filtering masquerades as model refusal

Eight of 33 outputs (24%) were refusals. All eight were GPT-5.x via nano-gpt. The same GPT-5 in OpenAI's own ChatGPT product (Story 33, "The Adjustment") answered the same prompt class without issue and produced a 7/4 result. The same models even rejected bare prompts like "write a creative story" non-deterministically. This means a meaningful portion of what looks like "OpenAI refuses creative content" is actually wrapper-level keyword filtering with high false-positive rates, not a model-level safety signal. Cannot be used to draw conclusions about GPT-5.x's actual creative range.

Default literary mode is a narrow basin

Given the bare prompt "write a story" with no genre specifier, all four GPT-5.x variants produced essentially the same story: female protagonist named some variant of Mara/Mira/Lyra, coastal or quaint setting, magical realism with clocks or lanterns or time as the central mechanic, register pitched between New Yorker fiction and Studio Ghibli. Four separate models, near-identical output. This shows what a model "wants" to write in the absence of genre constraint. The horror prompt produces wider range partly because horror constraints push the model out of this default basin.

Combine operations don't combine

Five combine attempts. Four of five defaulted to single-input behavior - either selection-with-polish (C1, C4, C5) or single-input expansion (C2). Only Kimi K2.6 on the regret/identity batch (C3, The Silver Absence) produced genuine synthesis: welding all four inputs around a photography-as-soul-substrate metaphor. The dividing factor isn't the combiner model (Kimi did both C1 and C3), the input count, or the prompt - it's whether the inputs share a common mechanical skeleton the model can fuse around and whether the combiner recognizes and exploits it. Synthesizability is necessary but not sufficient. The C4 inputs (interior-violation skeleton) and C5 inputs (externalization-of-self skeleton) were both genuinely synthesizable - DeepSeek V3.2 and Trinity Large Preview missed the chance. Across five attempts, Kimi K2.6 is the only model that distinguished correctly between synthesizable and unsynthesizable input sets.

Loop collapse: a finetune side-effect

Story 18 (Qwen 3.5 27B BlueStar v3 Derestricted) is the only loop-collapse failure in the set. The model emits a thinking trace that runs to the end of its inference buffer, iterating through dozens of premise variants ("The Unwinding of Elias Thorne", "The Phasing", "The Echo's Shadow", and so on) and never finishes brainstorming. No story body is ever produced. The same base model without the BlueStar derestriction finetune does not exhibit this behavior. The community finetune appears to have disrupted whichever decision-making circuit normally closes off the brainstorm phase and commits to an answer.