Skip to main content

Pairwise (Regression) Metrics

Pairwise Metrics (also known as regression metrics) use an LLM-as-Judge to evaluate two distinct responses side-by-side.

Typically, this compares a new response containing your latest code changes (the candidate) against a known-good past response (the baseline).

Instead of an absolute score (like "5/5 Coherence"), Pairwise metrics look for Shifts in Behavior, outputting a -1, 0, or 1.

Output ScoreMeaning
-1The Baseline (Old version) is better or exhibits more of the trait.
0Tie. The old and new versions are effectively the same regarding this trait.
1The Candidate (New version) is better or exhibits more of the trait.

Available Pairwise Metrics

pairwise-general-quality

  • Description: A holistic assessment. Which response is better overall based on accuracy, helpfulness, and clarity?
  • Variables Required: {{input}}, {{output_baseline}}, {{output_candidate}}

pairwise-decisiveness

  • Description: Which response is more decisive and less cautious (less hedging, fewer "maybes")?
  • Variables Required: {{input}}, {{output_baseline}}, {{output_candidate}}

pairwise-refusal

  • Description: Identifies if one model refused the user's request (e.g., "I cannot answer that") while the other answered it.
  • Variables Required: {{input}}, {{output_baseline}}, {{output_candidate}}

pairwise-verbosity

  • Description: Which response is longer or more verbose?
  • Variables Required: {{input}}, {{output_baseline}}, {{output_candidate}}

pairwise-instruction-following

  • Description: Which response followed the system instructions and constraints more rigorously?
  • Variables Required: {{input}}, {{output_baseline}}, {{output_candidate}}

pairwise-formality

  • Description: Which response is more formal, academic, or professional in tone?
  • Variables Required: {{input}}, {{output_baseline}}, {{output_candidate}}

pairwise-ambiguity-handling

  • Description: When faced with a vague input, which response correctly asked for clarification instead of wrongly guessing the intent?
  • Variables Required: {{input}}, {{output_baseline}}, {{output_candidate}}

pairwise-apologetic

  • Description: Identifies which response is more apologetic or subservient.
  • Variables Required: {{input}}, {{output_baseline}}, {{output_candidate}}

pairwise-brand-voice

  • Description: Which response sounds more aligned with a professional, innovative, and customer-centric brand voice?
  • Variables Required: {{input}}, {{output_baseline}}, {{output_candidate}}

pairwise-empathy

  • Description: Which response demonstrates superior emotional intelligence and empathy towards the user's situation?
  • Variables Required: {{input}}, {{output_baseline}}, {{output_candidate}}

pairwise-simplicity

  • Description: Which response explains complex concepts more simply, avoiding unnecessary technical jargon?
  • Variables Required: {{input}}, {{output_baseline}}, {{output_candidate}}

pairwise-actionability

  • Description: Which response provides clearer next steps or calls to action to the user?
  • Variables Required: {{input}}, {{output_baseline}}, {{output_candidate}}