Skip to main content

Metrics Overview

Metrics are the core grading criteria CertOps uses to evaluate the quality of your AI system's outputs.

When you configure a test in your certops.yaml, you map dataset columns to specific metrics. CertOps then runs these metrics against the output produced by your Target.

CertOps provides over 30 built-in metrics categorized into three distinct tiers based on how they operate and what they measure.

The Three Tiers of Metrics

1. Deterministic Metrics

Deterministic metrics are computed locally using traditional rules-based programming or fast embedding models (like Cosine Similarity or ROUGE-L).

  • Cost: Free (Zero LLM Tokens).
  • Use Case: Checking strict structural rules (like JSON validity), exact lexical matching, or basic semantic similarity against a known reference.

2. Pointwise Metrics (Evaluation)

Pointwise metrics use an advanced "LLM-as-Judge" to evaluate a single response in isolation.

  • Cost: Requires an LLM call.
  • Use Case: Evaluating subjective or complex criteria that traditional code cannot handle, such as Hallucinations, Tone Consistency, Tone, or Coherence.

3. Pairwise Metrics (Regression)

Pairwise metrics use an LLM-as-Judge to compare two responses side-by-side (usually a new candidate response vs. a known baseline response).

  • Cost: Requires an LLM call.
  • Use Case: Determining if a new prompt or model swap actually improved the system. Useful for measuring shifts in Formality, Decisiveness, or General Quality.