ServicesProcessWorkChatAboutPricingBlogBook a call
Field notes
Apr 18, 2026 · 9 min read

Why we write the eval suite before we write the agent

eval-runner · agent-v1 · run history
production threshold>90%
RunWeekEval scoreScore
RUN-01Wk 1
31%
RUN-02Wk 1
42%
RUN-03Wk 2
58%
RUN-04Wk 2
67%
RUN-05Wk 3
79%
RUN-06Wk 3
88%
RUN-07Wk 4
94%
7 eval runs · 4 weeks
31% → 94%scoreboard before player
By the studio
Agnotiq Studio

At Agnotiq, we do something that can feel backward at first: we write the eval suite before we write the agent. That choice is deliberate.

For SMB teams, the goal is not to build a flashy demo; it is to build an agent that earns trust, saves time, and behaves consistently when the work gets messy.

Start with the score, not the story

Most teams begin with the question, “What can the agent do?” We begin with a different question: “What should success look like when a real business user depends on this?” The eval suite turns that answer into a scoring function, which becomes the contract for the product.

That contract matters because agentic software is not judged by one perfect output. It is judged by repeatability, handoff quality, accuracy, and how often it avoids creating work for the customer. If you cannot measure those things early, you will spend the first month arguing from opinions instead of evidence.

Why the eval comes first

Writing the eval first forces the team to define the business outcome in plain language. It also prevents scope drift, because every new idea has to answer a simple question: does this improve the score in a way the customer would actually feel?

For SMB buyers, that discipline matters even more. They do not want a platform that is technically impressive but operationally fragile. They want something that helps their team move faster without adding supervision overhead.

Scoreboard before player
Eval suite
  • Defines success
  • Finds failures
  • Guides iteration
Build agent
  • Test against score
  • Improve the workflow
Business outcome
  • More consistency
  • Less supervision
  • Higher trust
The three-stage flow we run on every engagement

The awkward first week

The first prototype usually scores badly. Ours often starts around 31%. That is not a failure; it is the point where the team learns what the system is truly good at and where it breaks.

31%
First prototype score

A 31% score is uncomfortable because it turns vague ambition into a visible number. It also gives the product team something far more useful than optimism: a list of specific failure modes. Maybe the agent misses context, chooses the wrong tool, over-explains, or stops too early. Each of those problems can be improved only when it is visible in the eval.

How the work changes

Once the score exists, the build process becomes more disciplined. We do not ask, “Does this feel smarter?” We ask, “Did the score improve, and did the business outcome improve with it?” That shifts the team from improvisation to iteration.

This is especially useful for SMB use cases, where the margin for error is low and the tolerance for complexity is even lower. The eval suite keeps the product honest: if a feature looks impressive but lowers reliability, it does not ship.

What this means for customers

Business buyers do not need to know the mechanics of the eval suite to benefit from it. They feel it in the product as fewer surprises, more consistent outputs, and a system that improves instead of drifting.

That is the real reason we start with scoring. The eval suite is not a technical ritual. It is how we make sure the agent earns its place in a real workflow.

A simple way to think about it

Think of the eval suite as the scoreboard and the agent as the player. If you build the player first, you may end up with a talented system that no one can trust. If you define the scoreboard first, every improvement has a direction.

That direction is what turns a prototype into a product. Ship the eval before the agent.

Let's build

Have a workflow that deserves an agent?

Tell us what's eating your team's afternoons. We'll come back inside three days with a discovery plan, a price, and the names of the engineers we'd put on it.