Evaluation & Regression Config

Within your target definition in the certops.yaml manifest, you must configure how CertOps actually grades the responses it receives from your AI component.

CertOps utilizes a two-tiered gating system:

Pointwise Evaluation: Grading individual responses in isolation against expected criteria.
Regression (Drift): Comparing current performance against a historical known-good baseline.

1. Pointwise Evaluation Block

The evaluation block defines the absolute criteria your target must meet for a single run.

    evaluation:
      # Map dataset columns -> metric prompt variables
      metrics_mapping:
        input: "user_query"
        reference: "expected_document"

      # Deterministic local tier
      deterministic:
        - metric: "cosine-similarity"
          threshold: 0.80
          operator: "gte"
          blocking: true

      # LLM Judge per-sample tier
      llm:
        - metric: "hallucination"
          threshold: 1.0
          operator: "gte"
          blocking: true
        - metric: "answer-relevance"
          threshold: 0.8
          operator: "gte"
          blocking: false 

The `blocking` Flag (Hard vs Soft Gates)

blocking: true: This metric is a Hard Gate. If the average score across your dataset falls below the threshold, the Component Verdict (and the Suite by extension) is REJECTED. The CLI will exit with a non-zero exit code.
blocking: false: This metric is a Soft Gate. If it fails, CertOps will log a warning in the dashboard, but the overall run can still be marked as APPROVED.

The `metrics_mapping` Object

Because datasets are decoupled from metrics, you must explicitly tell CertOps which dataset columns should be piped into the Jinja2 variables required by your chosen metrics.

If your chosen metric (hallucination) requires an {{ input }} variable, but your CSV has a column named user_query, you map it here (input: "user_query"). (Note: The {{ output }} variable is automatically injected by CertOps using the parsed response_path from your target).

2. Regression Block

The regression block symmetrically matches the evaluation block, but instead of grading against an absolute threshold, it enforces comparisons against a Baseline (usually representing the version currently live in Production).

    regression:
      baseline: "prod"

      # Average Drift Tier
      deterministic:
        - metric: "cosine-similarity"
          max_drift: 0.05      # Avg run score cannot drop more than 5% vs baseline
        - metric: "hallucination"
          max_drift: 0.1

      # Pairwise LLM vs baseline output tier
      metrics_mapping:
        input: "question"
      llm:
        - metric: "pairwise-general-quality"
          max_loss_rate: 0.3   # Candidate can't lose > 30% of individual matches
          blocking: true

max_drift: In the deterministic/pointwise tier, this prevents silent degradation. If your baseline hallucination score was 0.95, a max_drift of 0.1 means your new candidate build cannot score lower than 0.85.
max_loss_rate: In the Pairwise LLM tier, the judge_model_config explicitly compares the old prod output vs the new candidate output side-by-side. If the new candidate loses more than 30% of those head-to-head comparisons, the run is rejected.

1. Pointwise Evaluation Block​

The blocking Flag (Hard vs Soft Gates)​

The metrics_mapping Object​

2. Regression Block​

1. Pointwise Evaluation Block

The `blocking` Flag (Hard vs Soft Gates)

The `metrics_mapping` Object

2. Regression Block