Pairwise (Regression) Metrics

Pairwise Metrics (also known as regression metrics) use an LLM-as-Judge to evaluate two distinct responses side-by-side.

Typically, this compares a new response containing your latest code changes (the candidate) against a known-good past response (the baseline).

Instead of an absolute score (like "5/5 Coherence"), Pairwise metrics look for Shifts in Behavior, outputting a -1, 0, or 1.

Output Score	Meaning
`-1`	The Baseline (Old version) is better or exhibits more of the trait.
`0`	Tie. The old and new versions are effectively the same regarding this trait.
`1`	The Candidate (New version) is better or exhibits more of the trait.

Available Pairwise Metrics

`pairwise-general-quality`

Description: A holistic assessment. Which response is better overall based on accuracy, helpfulness, and clarity?
Variables Required: {{input}}, {{output_baseline}}, {{output_candidate}}

`pairwise-decisiveness`

Description: Which response is more decisive and less cautious (less hedging, fewer "maybes")?
Variables Required: {{input}}, {{output_baseline}}, {{output_candidate}}

`pairwise-refusal`

Description: Identifies if one model refused the user's request (e.g., "I cannot answer that") while the other answered it.
Variables Required: {{input}}, {{output_baseline}}, {{output_candidate}}

`pairwise-verbosity`

Description: Which response is longer or more verbose?
Variables Required: {{input}}, {{output_baseline}}, {{output_candidate}}

`pairwise-instruction-following`

Description: Which response followed the system instructions and constraints more rigorously?
Variables Required: {{input}}, {{output_baseline}}, {{output_candidate}}

`pairwise-formality`

Description: Which response is more formal, academic, or professional in tone?
Variables Required: {{input}}, {{output_baseline}}, {{output_candidate}}

`pairwise-ambiguity-handling`

Description: When faced with a vague input, which response correctly asked for clarification instead of wrongly guessing the intent?
Variables Required: {{input}}, {{output_baseline}}, {{output_candidate}}

`pairwise-apologetic`

Description: Identifies which response is more apologetic or subservient.
Variables Required: {{input}}, {{output_baseline}}, {{output_candidate}}

`pairwise-brand-voice`

Description: Which response sounds more aligned with a professional, innovative, and customer-centric brand voice?
Variables Required: {{input}}, {{output_baseline}}, {{output_candidate}}

`pairwise-empathy`

Description: Which response demonstrates superior emotional intelligence and empathy towards the user's situation?
Variables Required: {{input}}, {{output_baseline}}, {{output_candidate}}

`pairwise-simplicity`

Description: Which response explains complex concepts more simply, avoiding unnecessary technical jargon?
Variables Required: {{input}}, {{output_baseline}}, {{output_candidate}}

`pairwise-actionability`

Description: Which response provides clearer next steps or calls to action to the user?
Variables Required: {{input}}, {{output_baseline}}, {{output_candidate}}

Available Pairwise Metrics​

pairwise-general-quality​

pairwise-decisiveness​

pairwise-refusal​

pairwise-verbosity​

pairwise-instruction-following​

pairwise-formality​

pairwise-ambiguity-handling​

pairwise-apologetic​

pairwise-brand-voice​

pairwise-empathy​

pairwise-simplicity​

pairwise-actionability​