Skip to main content

Custom Metrics

While CertOps provides over 30 built-in "System" metrics, every AI use case is unique. You will frequently encounter scenarios where you need to evaluate highly specific organizational policies, complex routing rules, or deeply nuanced brand personas.

CertOps allows you to easily create your own Custom LLM-as-Judge Metrics via the Dashboard UI or the REST API. You can create both Pointwise (Evaluation) and Pairwise (Regression) custom metrics.

Creating a Custom Metric

To create a new custom metric in the Dashboard, navigate to the Metrics screen and click + New Metric.

You will need to configure three primary sections:

1. Basic Information

  • Metric Name: A human-readable identifier (e.g., Brand Tone Check). The system will automatically generate a kebab-case ID (e.g., brand-tone-check) that you will use in your certops.yaml.
  • Description: A brief explanation of what the metric evaluates, helping your team understand its purpose.

2. Score Configuration

Unlike testing frameworks with rigid Pass/Fail binaries, CertOps allows you to define the exact quantitative scale your metric operates on:

  • Minimum Score: The absolute lowest possible score (e.g., 0, 1, or -1).
  • Maximum Score: The absolute highest possible score (e.g., 1, 5, or 100).
  • Pass Threshold: (Optional) The cut-off score required for the evaluation to be considered "Successful" or "Passing" within a CI/CD Quality Gate.

3. Prompt Template (Jinja2)

This is the core instructions sent to the LLM Judge. It contains your scoring rubric, edge cases, and the criteria the LLM must follow to grade the output securely.

CertOps uses Jinja2 templating to inject the evaluation data dynamically. You are not restricted to a fixed set of variables.

You can define any arbitrary variable names you need by wrapping them in double curly braces (e.g., {{user_persona}}, {{strict_rules}}, {{expected_length}}).

When you configure your test in the certops.yaml manifest, you will map these custom Jinja variables directly to the column names in your Dataset. This allows your metrics to be highly specialized without forcing your datasets into a rigid schema.

Tip! When writing your prompt template, always instruct the LLM to output a strictly structured response containing the numerical score. For example: Provide ONLY a JSON object: {"score": <number>, "reasoning": "<explanation>"}.