Skip to main content

Evaluation & Regression Config

Within your target definition in the certops.yaml manifest, you must configure how CertOps actually grades the responses it receives from your AI component.

CertOps utilizes a two-tiered gating system:

  1. Pointwise Evaluation: Grading individual responses in isolation against expected criteria.
  2. Regression (Drift): Comparing current performance against a historical known-good baseline.

1. Pointwise Evaluation Block

The evaluation block defines the absolute criteria your target must meet for a single run.

    evaluation:
# Map dataset columns -> metric prompt variables
metrics_mapping:
input: "user_query"
reference: "expected_document"

# Deterministic local tier
deterministic:
- metric: "cosine-similarity"
threshold: 0.80
operator: "gte"
blocking: true

# LLM Judge per-sample tier
llm:
- metric: "hallucination"
threshold: 1.0
operator: "gte"
blocking: true
- metric: "answer-relevance"
threshold: 0.8
operator: "gte"
blocking: false

The blocking Flag (Hard vs Soft Gates)

  • blocking: true: This metric is a Hard Gate. If the average score across your dataset falls below the threshold, the Component Verdict (and the Suite by extension) is REJECTED. The CLI will exit with a non-zero exit code.
  • blocking: false: This metric is a Soft Gate. If it fails, CertOps will log a warning in the dashboard, but the overall run can still be marked as APPROVED.

The metrics_mapping Object

Because datasets are decoupled from metrics, you must explicitly tell CertOps which dataset columns should be piped into the Jinja2 variables required by your chosen metrics.

If your chosen metric (hallucination) requires an {{ input }} variable, but your CSV has a column named user_query, you map it here (input: "user_query"). (Note: The {{ output }} variable is automatically injected by CertOps using the parsed response_path from your target).

2. Regression Block

The regression block symmetrically matches the evaluation block, but instead of grading against an absolute threshold, it enforces comparisons against a Baseline (usually representing the version currently live in Production).

    regression:
baseline: "prod"

# Average Drift Tier
deterministic:
- metric: "cosine-similarity"
max_drift: 0.05 # Avg run score cannot drop more than 5% vs baseline
- metric: "hallucination"
max_drift: 0.1

# Pairwise LLM vs baseline output tier
metrics_mapping:
input: "question"
llm:
- metric: "pairwise-general-quality"
max_loss_rate: 0.3 # Candidate can't lose > 30% of individual matches
blocking: true
  • max_drift: In the deterministic/pointwise tier, this prevents silent degradation. If your baseline hallucination score was 0.95, a max_drift of 0.1 means your new candidate build cannot score lower than 0.85.
  • max_loss_rate: In the Pairwise LLM tier, the judge_model_config explicitly compares the old prod output vs the new candidate output side-by-side. If the new candidate loses more than 30% of those head-to-head comparisons, the run is rejected.