pesky margin problem

WIP!

from 2020:

Anecdotally, we have seen a surprisingly consistent pattern in the financial data of AI companies, with gross margins often in the 50-60% range – well below the 60-80%+ benchmark for comparable SaaS businesses.”

five years later:

the latest disclosures [as of 2025] suggest Anthropic’s overall gross margin may not have improved since the end of 2023, when the figure was between 50% and 55%.
In comparison, OpenAI earlier this year projected a gross profit margin of 48% in 2025.

the AI gross margins profile hasn't changed in five years. which makes these projections eyebrow raising:

OpenAI: 70% gross profit margin by 2029.
Anthropic: 75% gross profit margin by 2027.

i know, “questioning margins is a boring cliche.” many will say it’s not a good metric right now, drawing a parallel to SaaS growth: obviously, you want to spend money to get customers now and build defensibility.

there's two missteps here:

customer acquisition costs are not what's driving poor margins - cogs are. some have tried to reframe COGS as CAC, which has funky implications for distribution (that i'll spell out shortly).
are costs actually falling? and if they do, doesn't this issue get worse?

to address the first point:

“COGS = CAC: In AI, compute is the new customer-acquisition cost. Keep margins tight by building self-selling products that grow through usage and community.”

from https://www.bvp.com/atlas/scaling-an-ai-supernova-lessons-from-anthropic-cursor-and-fal

your distribution is your pricing power. network effects, controlled supply, proprietary tech, etc. reframing cogs as cac is essentially saying that underpricing your services is your distrubution strategy - you see how circular it is, to claim that underpricing your services is key to having pricing power (presumably to raise prices in the future)? there's nothing to build accrual.

as i've written before, the culprit of poor margins is high variance of usage and resource requirements across different requests. this is here to stay - it's inherent to large language models architecture and product surfaces.

while i haven't had the chance to see the unit economic of leading AI companies, in my own career i've seen that as much as 30-50% of usage for ai products can reside in the long tail of intent. said another way: if you give users a blank box, users can - and will - enter just about anything into it.

which brings me to the second prong of the argument: costs are plummeting anyway, poor margins are a temporary blip.

per token costs have fallen with some model releases, yes. but "cost per task" hasn't decreased meaningfully, because each task just outputs more tokens.

the fact that costs haven’t actually declined might be a blessing in disguise. a lack of pricing power, increasing competition and declining costs is a self-propelling race to the bottom.

one argument you'll hear often is smaller models can optimize costs as a way to improve margins.

"If a subset of the more expensive queries can be identified and serviced by a cheaper model, the margin problem is mollified. Many of the leading app companies we’re aware of are quite sophisticated at routing high value queries to margin optimized inference." -a16z

this is a curious argument. smaller models means more models, which means more competition, which of course means declining pricing power and lower margins.

but who said anything about competition? "more models" might just imply a centralized routing model, like GPT-5 which reportedly has great margins.

a16z, five years ago, spotlights the very counterargument that disproves that thesis:

"Eliminate model complexity as much as possible. We’ve seen a massive difference in COGS between startups that train a unique model per customer versus those that are able to share a single model (or set of models) among all customers. The “single model” strategy is easier to maintain, faster to roll out to new customers, and supports a simpler, more efficient engineering org."

you are incurring serious time and maintenance cost with more models, which leads to variance, which leads to varying requests which leads to… you get it.

so okay, variance of usage is guaranteed with open-ended workflows which also guarantees structurally low margins (compared to saas baselines). that is not necessarily a blocker for these companies achieving free cash flow.

what elevates this from a business model problem to a first class product problem is that quality gets worse with more users.

this is explicit in system outages and service interruptions that i highlighted in part 1. less explicit but still visible in perceptions of degraded quality and quantization.

“it's so absurd that you have CUDA engineers getting paid $100M/year to squeeze out 25% more performance and then you have app-layer developers who undo all of it plus much more
token-based pricing is prob what makes labs shrug but like ... if the point is speed, app matters!!”

token-based pricing schemes are deeply misaligned with delivered value to the end-user.

relatedly:

"AI can write novel proofs at the level of a world-class competitive mathematician
but it still can’t reliably book me a weekend trip to boston
so strange"

Some of this, to be frank, is a skill issue. ie. variance exists in the mapping between requests and tasks, and a more skilled user can complete a task with fewer requests than a less skilled user.

but it can't all be explained by that, or even by the "honeymoon period" wearing off and weaknesses becoming more obvious with repeated exposure.

consider, the emerging argument that labs already have models that outperform the publicly available models today. people commonly cite OpenAI's unreleased model that won gold on the IMO in July. they speculate that they are too expensive to release. that's probably true, and i submit the corollary:

intelligence is jagged and labs are incentivized to showcase the peaks. valleys only surface through usage.

these quality troughs are invisible in the lab because they are statistical: you hit them only when you sample the whole distribution at scale.

i’ve made an implicit assumption thus far: AI services are either not valuable enough or too competitive to achieve pricing power. the latter is true for pure API stuff, but i'm suspicious of the former for AI products.

it is my belief that most of the meaningful progress next year will happen in the agentic harnesses rather than the models.

this is interesting for two reasons. both claude code and codex are open source: any improvements at the harness layer will proliferate much faster than the model layer.

and agentic harnesses are actionable. which means it's valuable. i mean, claude code already proved codegen is price inelastic, people are paying $200 which is just unheard of.

so, agentic harnesses are a better surface than chat to innovate on pricing, that actually attempt to capture value. and the direction codex is taking - paid credits at the message level, rather than the token level, is an important one. credit consumption varies not in relation to tokens consumed, but in relation to value delivered.

if you can measure “how much the user actually got” (time saved, revenue created, problem solved) you can price that instead of the raw compute.

once you price the result, you can afford to spend the exact amount of compute and data that maximizes profit on each task.

so the layer that owns the meter for “result” decides how much compute and data to deploy and keeps the spread between cost and price. today that meter sits inside the model; tomorrow it could sit inside an agent that plans the whole workflow.

so the prize may go to whoever defines and measures “intelligence value,” not necessarily whoever trains the biggest model.

the foundation-model layer is a subcontractor—cheap, interchangeable, and paid by the gallon. the (potential) orchestration layer is the one that can turn the meter into a moat (workflow lock-in, usage data, integrations).