Evaluation and Regression

CertOps provides two core evaluation paradigms that operate in tandem. This symmetric structure is how complex quality control is managed over time.

Pointwise Evaluation

Pointwise evaluation measures a simple pass/fail threshold on a per-run basis. When you execute a system run, the Engine tests the target and evaluates the returned responses against hard, objective boundaries.

Examples:

Deterministic Evaluation: Reject the run if the cosine-similarity drops below 0.85.
LLM Evaluation: Reject the run if the hallucination score is higher than 0.1.

If the pointwise boundaries fail, CertOps rejects the CI/CD pipeline. Even if this is the first time you've ever tested the prompt, pointwise evaluation guarantees baseline competence.

Regression (Baselines)

Quality testing can't exist in a vacuum. What if your agent passes the baseline pointwise criteria (>0.85), but actually scored a 0.95 yesterday on the prod branch? You need to know if your system degraded.

CertOps tracks baselines by Tag (e.g., prod or staging). Once a run passes pointwise criteria, it can be promoted to represent the new "Golden Baseline" for that tag.

When regression tracking is enabled, CertOps resolves against the active baseline, invoking two specialized gate types.

1. Deterministic Drift

Drift tests detect drops in performance over time across metrics. It compares the current aggregated run average against the baseline aggregated run average.

Example: You configure a max_drift threshold of 0.05 for your rouge-l metric.

Baseline Average: 0.90
Current Average: 0.87 (Drift = 0.03). -> PASS
If the current average dropped to 0.80, the drift is 0.10, violating your 0.05 threshold. -> FAIL

2. LLM Pairwise (A/B Testing)

Averaged metrics don't always tell the whole story. Pairwise testing re-runs your LLM judges to directly compare the candidate_output (your new system context) against the baseline_output (the previous gold standard) on a per-sample basis.

This results in a Win / Loss / Tie rate.

Example: You configure a max_loss_rate of 0.3 (30%).

W: 50% | L: 20% | T: 30% -> PASS (Candidate is safe)
W: 20% | L: 60% | T: 20% -> FAIL (Candidate is consistently worse than the baseline, pipeline blocked)

Note: If a run acts as its very first baseline lookup and cannot resolve a tag, CertOps provides an informational result and gracefully passes the Regression gates.

Pointwise Evaluation​

Regression (Baselines)​

1. Deterministic Drift​

2. LLM Pairwise (A/B Testing)​

Pointwise Evaluation

Regression (Baselines)

1. Deterministic Drift

2. LLM Pairwise (A/B Testing)