← All posts
Claude & AI·6 min read

Why we bet the product on Claude, not the cheapest model

We didn't pick the model that was cheapest. We picked the one that didn't flinch.

D

Daniel

Fifteen years running growth for SaaS, ecommerce, and hardware brands. Currently shipping SaaSValidatr out of Australia.

Every founder building on top of AI has had the same meeting with themselves: which model do we pick. The answer that feels "safe" is usually whichever model is cheapest per million tokens, because margins matter and AI costs scale with usage. We looked at that maths hard and then deliberately walked away from it.

SaaSValidatr is a scoring product. Somebody types in an idea, we grade it across five dimensions, surface risks, propose revenue models, name the competitors, write a build prompt. If the scoring is off the product is worthless. So the question wasn't 'what's cheapest.' It was 'what gives the founder the fewest reasons to doubt the verdict.'

The test we ran on every frontier model

We hand-wrote twelve ideas. Four obviously strong, four obviously weak, four deliberately ambiguous. Then we ran each through GPT-4, Claude Sonnet, Gemini, and one open-source challenger, asked each to score market viability, revenue potential, feasibility, uniqueness, simplicity. We graded the graders.

The cheapest model was right on the easy ones and completely wrong on the ambiguous ones — it would confidently score a mediocre idea 8/10 because the description sounded enthusiastic. That's a deal-breaker for a validation tool. If the model is seduced by buzzwords it's validating the pitch, not the idea.

Claude's scoring was the most conservative, the most specific in its risks, and — critically — the most willing to say 'this will probably fail.' That last part matters. Most founders don't want to hear no. A model that always finds a reason to say yes has optimised itself out of the job.

Why 'doesn't flinch' is a feature

When we launched the Devil's Advocate mode we stress-tested it against ideas we loved internally. GPT gave us encouraging energy with soft caveats. Claude told us exactly which assumption was load-bearing and which competitor would eat us first. It was uncomfortable. That was the sign we were on the right one.

A validation tool that always says yes is just an expensive mirror.
Our first internal review

There's also a compliance dimension that gets under-discussed. Anthropic's API terms are explicit: inputs and outputs are not used for model training. For a product where users paste their not-yet-public business ideas into a text box, that line in the terms of service isn't a nice-to-have — it's the product. We're not going to build a wedge that depends on 'trust us, we'll ask the provider nicely.'

What we get in return for picking the better model

  • Specific competitors by name, not vague categories.
  • Risks the user hasn't considered — not the risks anyone would write on a whiteboard in five minutes.
  • A willingness to say 'this is a bad idea, here's why.' Users find that more trustworthy than universal enthusiasm.
  • Longer context windows that let us pass the whole idea analysis into downstream features — pitch deck, MVP page, competitor scan — without truncation.

Yes, we pay more per request. We bake that into the plan pricing and it's fine. What we get in return is a product where the user walks away with useful information rather than confirmation bias, and we sleep at night knowing their ideas aren't training someone else's model.

The part nobody writes down

Fifteen years doing growth for SaaS companies taught me one thing louder than anything else: the tool has to be right before the marketing can be honest. Every clever positioning line we've tested on the SaaSValidatr landing page works because the thing underneath actually works. I've seen enough founders paper over a mediocre product with great copy. That ceiling is real and it's low. Picking Claude was a bet that we'd rather invest in the floor than the ceiling.

If you're building on AI right now, I'd push you to run the same exercise. Hand-write twelve inputs. Grade the graders. Pick the model that's right when being right is awkward, not the model that's cheap when being wrong is cheap.

Score your first idea in 30 seconds — free

Free forever · no card required