Evaluation and Regression
CertOps provides two core evaluation paradigms that operate in tandem. This symmetric structure is how complex quality control is managed over time.
Pointwise Evaluation
Pointwise evaluation measures a simple pass/fail threshold on a per-run basis. When you execute a system run, the Engine tests the target and evaluates the returned responses against hard, objective boundaries.
Examples:
- Deterministic Evaluation: Reject the run if the
cosine-similaritydrops below0.85. - LLM Evaluation: Reject the run if the
hallucinationscore is higher than0.1.
If the pointwise boundaries fail, CertOps rejects the CI/CD pipeline. Even if this is the first time you've ever tested the prompt, pointwise evaluation guarantees baseline competence.
Regression (Baselines)
Quality testing can't exist in a vacuum. What if your agent passes the baseline pointwise criteria (>0.85), but actually scored a 0.95 yesterday on the prod branch? You need to know if your system degraded.
CertOps tracks baselines by Tag (e.g., prod or staging). Once a run passes pointwise criteria, it can be promoted to represent the new "Golden Baseline" for that tag.
When regression tracking is enabled, CertOps resolves against the active baseline, invoking two specialized gate types.
1. Deterministic Drift
Drift tests detect drops in performance over time across metrics. It compares the current aggregated run average against the baseline aggregated run average.
Example: You configure a max_drift threshold of 0.05 for your rouge-l metric.
- Baseline Average:
0.90 - Current Average:
0.87(Drift =0.03). -> PASS - If the current average dropped to
0.80, the drift is0.10, violating your0.05threshold. -> FAIL
2. LLM Pairwise (A/B Testing)
Averaged metrics don't always tell the whole story. Pairwise testing re-runs your LLM judges to directly compare the candidate_output (your new system context) against the baseline_output (the previous gold standard) on a per-sample basis.
This results in a Win / Loss / Tie rate.
Example: You configure a max_loss_rate of 0.3 (30%).
- W: 50% | L: 20% | T: 30% -> PASS (Candidate is safe)
- W: 20% | L: 60% | T: 20% -> FAIL (Candidate is consistently worse than the baseline, pipeline blocked)
Note: If a run acts as its very first baseline lookup and cannot resolve a tag, CertOps provides an informational result and gracefully passes the Regression gates.