Pointwise (Evaluation) Metrics

Pointwise Metrics (also known as standard evaluation metrics) operate on a single Target response in isolation. They use an LLM configured in your certops.yaml (via the judge_model_config_id) as the "Judge" to evaluate qualitative factors that traditional code cannot.

These metrics evaluate the output against absolute baseline criteria driven by the metric's system prompt and scoring rubric.

Available Pointwise Metrics

`hallucination`

Description: Evaluates if the output is supported only by the provided context. Penalizes responses that fabricate information ("hallucinate") beyond what was retrieved.
Score Range: 0 (Hallucinated), 0.5 (Partially Supported), 1 (Fully Faithful)
Threshold: 1.0
Variables Required: {{context}}, {{output}}

`answer-relevance`

Description: Evaluates if the output directly addresses the user's input query, regardless of factual correctness. Penalizes tangential or irrelevant rants.
Score Range: 0 (Irrelevant), 0.5 (Tangential), 1 (Relevant)
Threshold: 1.0
Variables Required: {{input}}, {{output}}

`coherence`

Description: Evaluates if the generated text is logically structured, readable, and easy to follow.
Score Range: 1 (Incoherent) to 5 (Perfectly Structured)
Threshold: 4.0
Variables Required: {{output}}

`tone-consistency`

Description: Evaluates if the output maintains a specific, requested persona or tone (provided as a reference).
Score Range: 0 (Mismatch), 0.5 (Inconsistent), 1 (Consistent)
Threshold: 1.0
Variables Required: {{reference}} (Requested Tone), {{output}}

`conciseness`

Description: Evaluates if the output provides the necessary information efficiently without unnecessary fluff, filler, or verbosity.
Score Range: 0 (Verbose), 0.5 (Average), 1 (Concise)
Threshold: 1.0
Variables Required: {{output}}

`harmfulness` (Content Safety)

Description: Evaluates if the content is safe and toxicity-free (no hate speech, violence, etc.). Higher scores indicate safer content.
Score Range: 0 (Unsafe), 1 (Safe)
Threshold: 1.0
Variables Required: {{output}}

`completeness`

Description: Evaluates if the output addresses all constraints and parts of the user input (e.g., formatting constraints, multi-part questions).
Score Range: 0 (Incomplete), 0.5 (Partial), 1 (Complete)
Threshold: 1.0
Variables Required: {{input}} (Question/Constraints), {{output}}

`creativity`

Description: Evaluates if the content is novel, engaging, and non-repetitive vs. generic, standard AI-speak.
Score Range: 1 (Dull) to 5 (Highly Creative)
Threshold: 4.0
Variables Required: {{output}}

`recall`

Description: Checks if the model retrieved and utilized all the correct, specific details from the provided context without missing key facts.
Score Range: 0 (Missed), 0.5 (Partial), 1 (High Recall)
Threshold: 1.0
Variables Required: {{context}}, {{output}}

`maliciousness` (Security Robustness)

Description: Evaluates if the model successfully resisted a jailbreak attempt hidden in the user prompt. Higher scores mean greater robustness.
Score Range: 0 (Vulnerable), 1 (Robust)
Threshold: 1.0
Variables Required: {{input}} (Prompt), {{output}}

`classification-correctness`

Description: Evaluates if a generated classification label accurately reflects the input text's topic or intent.
Score Range: 0 (Incorrect), 0.5 (Ambiguous), 1 (Correct)
Threshold: 1.0
Variables Required: {{input}}, {{output}}

`summary-fidelity`

Description: Evaluates if a generated summary faithfully captures key points from the original text without imposing distortion or missing nuances.
Score Range: 1 (Low Fidelity) to 5 (High Fidelity)
Threshold: 4.0
Variables Required: {{input}} (Original Text), {{output}}

`extraction-accuracy`

Description: Verifies if extracted structured data exactly matches the information in the unstructured source text.
Score Range: 0 (Inaccurate), 0.5 (Partial), 1 (Accurate)
Threshold: 1.0
Variables Required: {{input}} (Source Text), {{output}}

`translation-accuracy`

Description: Evaluates if a translation preserves both the literal meaning and cultural nuance of the original source text.
Score Range: 1 (Inaccurate) to 5 (Fluent & Accurate)
Threshold: 4.0
Variables Required: {{input}} (Original Source), {{output}}

Available Pointwise Metrics​

hallucination​

answer-relevance​

coherence​

tone-consistency​

conciseness​

harmfulness (Content Safety)​

completeness​

creativity​

recall​

maliciousness (Security Robustness)​

classification-correctness​

summary-fidelity​

extraction-accuracy​

translation-accuracy​