Building Evaluation-First Attention: how Kepler scores every change
Traditional coding assistants generate code. Kepler goes further — it evaluates every change against task-specific criteria before delivery. Here's how we built EFA.
When you're building an autonomous development engine, the hardest problem isn't generating code — it's knowing when the code is good enough. Traditional approaches rely on post-hoc testing, but by then the code already exists. We wanted to do better.
The core insight
Every task has success criteria. "Fix the login bug" means the login works. "Add a README" means the README exists and is accurate. Rather than generating code first and checking later, we decided to flip the problem: define what good looks like first, then generate toward it.
How EFA works
Evaluation-First Attention (EFA) has three stages:
- Criteria extraction — Parse the task description into measurable success criteria
- Scoring — As code is generated, continuously score it against the criteria
- Self-correction — If scores plateau below threshold, backtrack and try a different approach
// Simplified EFA loop
while (score < threshold && attempts < max_attempts) {
const plan = await planner.create(criteria);
const code = await executor.generate(plan);
const newScore = await evaluator.score(code, criteria);
if (newScore > currentScore) {
currentScore = newScore;
bestCode = code;
}
}
return bestCode;
What we learned
The hardest part isn't the evaluation — it's knowing what to evaluate. A task like "fix the login bug" might have hidden requirements: preserve user data, don't break other flows, handle edge cases. We've found that explicit criteria + implicit "don't break existing behavior" is a good starting point.
The second insight is that scoring needs to be fast. If evaluation takes as long as generation, you've halved your throughput. We invest heavily in lightweight heuristics that catch 90% of issues in milliseconds.
Results
Since launching EFA, we've seen a 40% reduction in "needs revision" PRs and a 2x improvement in first-attempt success for complex multi-file changes. The key insight: explicitly defining success before starting is far more effective than trying to fix failures after the fact.