Evaluation & Regression Config
Within your target definition in the certops.yaml manifest, you must configure how CertOps actually grades the responses it receives from your AI component.
CertOps utilizes a two-tiered gating system:
- Pointwise Evaluation: Grading individual responses in isolation against expected criteria.
- Regression (Drift): Comparing current performance against a historical known-good baseline.
1. Pointwise Evaluation Block
The evaluation block defines the absolute criteria your target must meet for a single run.
evaluation:
# Map dataset columns -> metric prompt variables
metrics_mapping:
input: "user_query"
reference: "expected_document"
# Deterministic local tier
deterministic:
- metric: "cosine-similarity"
threshold: 0.80
operator: "gte"
blocking: true
# LLM Judge per-sample tier
llm:
- metric: "hallucination"
threshold: 1.0
operator: "gte"
blocking: true
- metric: "answer-relevance"
threshold: 0.8
operator: "gte"
blocking: false
The blocking Flag (Hard vs Soft Gates)
blocking: true: This metric is a Hard Gate. If the average score across your dataset falls below thethreshold, the Component Verdict (and the Suite by extension) is REJECTED. The CLI will exit with a non-zero exit code.blocking: false: This metric is a Soft Gate. If it fails, CertOps will log a warning in the dashboard, but the overall run can still be marked as APPROVED.
The metrics_mapping Object
Because datasets are decoupled from metrics, you must explicitly tell CertOps which dataset columns should be piped into the Jinja2 variables required by your chosen metrics.
If your chosen metric (hallucination) requires an {{ input }} variable, but your CSV has a column named user_query, you map it here (input: "user_query").
(Note: The {{ output }} variable is automatically injected by CertOps using the parsed response_path from your target).
2. Regression Block
The regression block symmetrically matches the evaluation block, but instead of grading against an absolute threshold, it enforces comparisons against a Baseline (usually representing the version currently live in Production).
regression:
baseline: "prod"
# Average Drift Tier
deterministic:
- metric: "cosine-similarity"
max_drift: 0.05 # Avg run score cannot drop more than 5% vs baseline
- metric: "hallucination"
max_drift: 0.1
# Pairwise LLM vs baseline output tier
metrics_mapping:
input: "question"
llm:
- metric: "pairwise-general-quality"
max_loss_rate: 0.3 # Candidate can't lose > 30% of individual matches
blocking: true
max_drift: In the deterministic/pointwise tier, this prevents silent degradation. If your baseline hallucination score was 0.95, amax_driftof 0.1 means your new candidate build cannot score lower than 0.85.max_loss_rate: In the Pairwise LLM tier, thejudge_model_configexplicitly compares the oldprodoutput vs the new candidate output side-by-side. If the new candidate loses more than 30% of those head-to-head comparisons, the run is rejected.