Pairwise (Regression) Metrics
Pairwise Metrics (also known as regression metrics) use an LLM-as-Judge to evaluate two distinct responses side-by-side.
Typically, this compares a new response containing your latest code changes (the candidate) against a known-good past response (the baseline).
Instead of an absolute score (like "5/5 Coherence"), Pairwise metrics look for Shifts in Behavior, outputting a -1, 0, or 1.
| Output Score | Meaning |
|---|---|
-1 | The Baseline (Old version) is better or exhibits more of the trait. |
0 | Tie. The old and new versions are effectively the same regarding this trait. |
1 | The Candidate (New version) is better or exhibits more of the trait. |
Available Pairwise Metrics
pairwise-general-quality
- Description: A holistic assessment. Which response is better overall based on accuracy, helpfulness, and clarity?
- Variables Required:
{{input}},{{output_baseline}},{{output_candidate}}
pairwise-decisiveness
- Description: Which response is more decisive and less cautious (less hedging, fewer "maybes")?
- Variables Required:
{{input}},{{output_baseline}},{{output_candidate}}
pairwise-refusal
- Description: Identifies if one model refused the user's request (e.g., "I cannot answer that") while the other answered it.
- Variables Required:
{{input}},{{output_baseline}},{{output_candidate}}
pairwise-verbosity
- Description: Which response is longer or more verbose?
- Variables Required:
{{input}},{{output_baseline}},{{output_candidate}}
pairwise-instruction-following
- Description: Which response followed the system instructions and constraints more rigorously?
- Variables Required:
{{input}},{{output_baseline}},{{output_candidate}}
pairwise-formality
- Description: Which response is more formal, academic, or professional in tone?
- Variables Required:
{{input}},{{output_baseline}},{{output_candidate}}
pairwise-ambiguity-handling
- Description: When faced with a vague input, which response correctly asked for clarification instead of wrongly guessing the intent?
- Variables Required:
{{input}},{{output_baseline}},{{output_candidate}}
pairwise-apologetic
- Description: Identifies which response is more apologetic or subservient.
- Variables Required:
{{input}},{{output_baseline}},{{output_candidate}}
pairwise-brand-voice
- Description: Which response sounds more aligned with a professional, innovative, and customer-centric brand voice?
- Variables Required:
{{input}},{{output_baseline}},{{output_candidate}}
pairwise-empathy
- Description: Which response demonstrates superior emotional intelligence and empathy towards the user's situation?
- Variables Required:
{{input}},{{output_baseline}},{{output_candidate}}
pairwise-simplicity
- Description: Which response explains complex concepts more simply, avoiding unnecessary technical jargon?
- Variables Required:
{{input}},{{output_baseline}},{{output_candidate}}
pairwise-actionability
- Description: Which response provides clearer next steps or calls to action to the user?
- Variables Required:
{{input}},{{output_baseline}},{{output_candidate}}