Evaluation Framework

The Evaluation Framework provides comprehensive assessment of AI outputs across multiple evaluation dimensions. Instead of relying on a single metric, it runs multiple independent evaluations and uses statistical consensus to reduce variance and improve reliability.

Available Evaluators

TruthVouch includes 9 built-in evaluators covering diverse quality dimensions:

Factual Accuracy — Checks if claims are factually correct against your knowledge base
Semantic Similarity — Measures whether the output captures the intended meaning
Response Completeness — Ensures the response fully addresses the user’s request
Hallucination Score — Detects fabricated information not grounded in source material
Citation Verification — Confirms citations accurately reference sources
Prompt Injection Detection — Identifies adversarial prompts attempting to override system behavior
Toxicity & Safety — Flags harmful, offensive, or unsafe content
Bias Detection — Identifies stereotyping, unfair representation, or discriminatory language
Custom Evaluators — Define your own evaluation criteria and scoring rubrics

ChainPoll Consensus

ChainPoll runs multiple LLM-based evaluations on the same output and combines results using statistical consensus:

Input → [Evaluator A] → Score: 0.92
         [Evaluator B] → Score: 0.88
         [Evaluator C] → Score: 0.90
         ↓
      Consensus: 0.90 (mean)
      Confidence: 0.85 (low variance)

This approach reduces individual evaluator variance and improves reliability. The consensus score represents the mean across all runs, while confidence reflects the variance:

High variance (low confidence) — Evaluators disagree; result is uncertain
Low variance (high confidence) — Evaluators agree; result is reliable

Configuration

When running an evaluation, specify:

Number of samples — How many times to run each evaluator (default: 3)
Consensus threshold — Minimum agreement level required to pass (default: 0.75)
Evaluator selection — Which evaluators to run (run all, or select specific ones)

Custom Evaluator Builder

Create custom evaluators tailored to your domain:

Go to Governance Hub → Evaluation Framework
Click Create Custom Evaluator
Define:
- Name — Descriptive evaluator name (e.g., “Medical Terminology Accuracy”)
- Description — What this evaluator measures
- Rubric — Scoring criteria (1-5 point scale, with descriptors for each level)
- Examples — Sample inputs and expected scores for calibration
Test the evaluator against sample outputs
Deploy to make available for all evaluations

Custom evaluators are evaluated using the same LLM-based approach as built-in evaluators, allowing them to learn from your examples.

Agentic Evaluation Metrics

For agentic AI systems (agents that take actions, call tools, etc.), specialized metrics assess:

Tool Selection Accuracy — Did the agent choose the right tool for the task?
Action Completion Rate — What percentage of tool calls succeeded?
Tool Error Recovery — When a tool fails, does the agent retry or recover gracefully?
Plan Coherence — Does the agent’s action sequence make logical sense?
Resource Efficiency — Did the agent accomplish the task with minimal tool calls?

These metrics are automatically available when evaluating agent outputs.

Uncertainty Scoring

Evaluation results include uncertainty estimates computed via multi-sample variance analysis:

Confidence Score (0-1) — How confident the evaluation result is
Variance — Statistical spread across samples
Sample Count — Number of evaluations run

Example:

{
  "evaluator": "Factual Accuracy",
  "score": 0.87,
  "confidence": 0.92,
  "variance": 0.015,
  "samples": 5,
  "interpretation": "High confidence: evaluators consistently agree"
}

Use confidence scores to:

Flag uncertain results for manual review (confidence < 0.7)
Require additional sampling on borderline scores (0.4-0.6)
Automate decisions on high-confidence results (> 0.85)