Field notes

Apr 18, 2026 · 9 min read

Why we write the eval suite before we write the agent

eval-runner · agent-v1 · run history

production threshold>90%

RunWeekEval scoreScoreChange made

RUN-01Wk 1

31%First prototype

RUN-02Wk 1

42%Prompt refinement

RUN-03Wk 2

58%Tool chain fix

RUN-04Wk 2

67%Edge case handling

RUN-05Wk 3

79%Context window tuning

RUN-06Wk 3

88%Final prompt pass

RUN-07Wk 4

94%↑ Ships to production

7 eval runs · 4 weeks

31% → 94%scoreboard before player

By the studio

Agnotiq Studio

At Agnotiq, we do something that can feel backward at first: we write the eval suite before we write the agent. That choice is deliberate.

For SMB teams, the goal is not to build a flashy demo; it is to build an agent that earns trust, saves time, and behaves consistently when the work gets messy.

Start with the score, not the story

Most teams begin with the question, “What can the agent do?” We begin with a different question: “What should success look like when a real business user depends on this?” The eval suite turns that answer into a scoring function, which becomes the contract for the product.

That contract matters because agentic software is not judged by one perfect output. It is judged by repeatability, handoff quality, accuracy, and how often it avoids creating work for the customer. If you cannot measure those things early, you will spend the first month arguing from opinions instead of evidence.

Why the eval comes first

Writing the eval first forces the team to define the business outcome in plain language. It also prevents scope drift, because every new idea has to answer a simple question: does this improve the score in a way the customer would actually feel?

For SMB buyers, that discipline matters even more. They do not want a platform that is technically impressive but operationally fragile. They want something that helps their team move faster without adding supervision overhead.

Scoreboard before player

Eval suite

Defines success
Finds failures
Guides iteration

Build agent

Test against score
Improve the workflow

Business outcome

More consistency
Less supervision
Higher trust

The three-stage flow we run on every engagement

The awkward first week

The first prototype usually scores badly. Ours often starts around 31%. That is not a failure; it is the point where the team learns what the system is truly good at and where it breaks.

31%

First prototype score

A 31% score is uncomfortable because it turns vague ambition into a visible number. It also gives the product team something far more useful than optimism: a list of specific failure modes. Maybe the agent misses context, chooses the wrong tool, over-explains, or stops too early. Each of those problems can be improved only when it is visible in the eval.

How the work changes

Once the score exists, the build process becomes more disciplined. We do not ask, “Does this feel smarter?” We ask, “Did the score improve, and did the business outcome improve with it?” That shifts the team from improvisation to iteration.

This is especially useful for SMB use cases, where the margin for error is low and the tolerance for complexity is even lower. The eval suite keeps the product honest: if a feature looks impressive but lowers reliability, it does not ship.

What this means for customers

Business buyers do not need to know the mechanics of the eval suite to benefit from it. They feel it in the product as fewer surprises, more consistent outputs, and a system that improves instead of drifting.

That is the real reason we start with scoring. The eval suite is not a technical ritual. It is how we make sure the agent earns its place in a real workflow.

A simple way to think about it

Think of the eval suite as the scoreboard and the agent as the player. If you build the player first, you may end up with a talented system that no one can trust. If you define the scoreboard first, every improvement has a direction.

That direction is what turns a prototype into a product. Ship the eval before the agent.

← Back to all field notes Talk to the team

Keep reading

Studio · May 2, 2026

What we read in 2026

Six papers, blogs, and books the Agnotiq studio kept returning to, all focused on agentic AI for the work that actually slows SMBs down.

By the studioRead

Engineering · Mar 30, 2026

Routing between frontier and open models without losing sleep

A small piece of plumbing that decides which model gets which call. Saves money, ages well, doesn't get cute.

By the studioRead

Case study · Mar 12, 2026

What 3.8M conversations taught us about ticket triage

The tickets that look easy are the ones that bite. Notes from a year of production triage.

By the studioRead

Let's build

Have a workflow that deserves an agent?

Tell us what's eating your team's afternoons. We'll come back inside three days with a discovery plan, a price, and the names of the engineers we'd put on it.

Start a project hello@agnotiq.com