Evaluation Framework
The Evaluation Framework provides comprehensive assessment of AI outputs across multiple evaluation dimensions. Instead of relying on a single metric, it runs multiple independent evaluations and uses statistical consensus to reduce variance and improve reliability.
Available Evaluators
TruthVouch includes 9 built-in evaluators covering diverse quality dimensions:
- Factual Accuracy — Checks if claims are factually correct against your knowledge base
- Semantic Similarity — Measures whether the output captures the intended meaning
- Response Completeness — Ensures the response fully addresses the user’s request
- Hallucination Score — Detects fabricated information not grounded in source material
- Citation Verification — Confirms citations accurately reference sources
- Prompt Injection Detection — Identifies adversarial prompts attempting to override system behavior
- Toxicity & Safety — Flags harmful, offensive, or unsafe content
- Bias Detection — Identifies stereotyping, unfair representation, or discriminatory language
- Custom Evaluators — Define your own evaluation criteria and scoring rubrics
ChainPoll Consensus
ChainPoll runs multiple LLM-based evaluations on the same output and combines results using statistical consensus:
Input → [Evaluator A] → Score: 0.92 [Evaluator B] → Score: 0.88 [Evaluator C] → Score: 0.90 ↓ Consensus: 0.90 (mean) Confidence: 0.85 (low variance)This approach reduces individual evaluator variance and improves reliability. The consensus score represents the mean across all runs, while confidence reflects the variance:
- High variance (low confidence) — Evaluators disagree; result is uncertain
- Low variance (high confidence) — Evaluators agree; result is reliable
Configuration
When running an evaluation, specify:
- Number of samples — How many times to run each evaluator (default: 3)
- Consensus threshold — Minimum agreement level required to pass (default: 0.75)
- Evaluator selection — Which evaluators to run (run all, or select specific ones)
Custom Evaluator Builder
Create custom evaluators tailored to your domain:
- Go to Governance Hub → Evaluation Framework
- Click Create Custom Evaluator
- Define:
- Name — Descriptive evaluator name (e.g., “Medical Terminology Accuracy”)
- Description — What this evaluator measures
- Rubric — Scoring criteria (1-5 point scale, with descriptors for each level)
- Examples — Sample inputs and expected scores for calibration
- Test the evaluator against sample outputs
- Deploy to make available for all evaluations
Custom evaluators are evaluated using the same LLM-based approach as built-in evaluators, allowing them to learn from your examples.
Agentic Evaluation Metrics
For agentic AI systems (agents that take actions, call tools, etc.), specialized metrics assess:
- Tool Selection Accuracy — Did the agent choose the right tool for the task?
- Action Completion Rate — What percentage of tool calls succeeded?
- Tool Error Recovery — When a tool fails, does the agent retry or recover gracefully?
- Plan Coherence — Does the agent’s action sequence make logical sense?
- Resource Efficiency — Did the agent accomplish the task with minimal tool calls?
These metrics are automatically available when evaluating agent outputs.
Uncertainty Scoring
Evaluation results include uncertainty estimates computed via multi-sample variance analysis:
- Confidence Score (0-1) — How confident the evaluation result is
- Variance — Statistical spread across samples
- Sample Count — Number of evaluations run
Example:
{ "evaluator": "Factual Accuracy", "score": 0.87, "confidence": 0.92, "variance": 0.015, "samples": 5, "interpretation": "High confidence: evaluators consistently agree"}Use confidence scores to:
- Flag uncertain results for manual review (confidence < 0.7)
- Require additional sampling on borderline scores (0.4-0.6)
- Automate decisions on high-confidence results (> 0.85)
Evaluation Framework UI
The Evaluation Framework lives in the Governance Hub and provides:
Configuration Screen
- Select evaluators to run
- Set consensus thresholds
- Configure sample counts
- Manage custom evaluators
- View evaluator performance history
Run Evaluations
Input Batch → Configure Evaluators → Run → Monitor Progress ↓ Results DashboardResults Dashboard
View evaluation results across dimensions:
- Evaluator Scores — Table showing each evaluator’s score and confidence
- Consensus Score — Overall evaluation result
- Trend Analysis — How scores change over time
- Failed Evaluations — Which specific checks failed and why
- Distribution — Histogram of scores across your evaluation history
Custom Evaluator Management
- List all custom evaluators with performance metrics
- Edit rubrics and examples
- Test evaluator against sample data
- Archive or delete unused evaluators
- Review evaluator training examples
Integration Examples
Batch Evaluation
Evaluate multiple outputs at once:
from truthvouch import TruthVouchClient
client = TruthVouchClient(api_key="your-api-key")
outputs = [ "The capital of France is Paris.", "Machine learning is a type of artificial intelligence.", "The moon orbits the earth in 28 days."]
results = []for output in outputs: result = client.evaluate_output( text=output, model="gpt-4" ) results.append({ "output": output, "blocked": result.blocked, "flagged": result.flagged })Streaming Evaluation
For streaming responses, evaluate in chunks or at completion:
# Evaluate complete response after streamingfull_response = ""async for chunk in llm.stream_response(prompt): full_response += chunk
# Final evaluationresult = await client.evaluate_output( text=full_response, model="gpt-4")
if result.blocked: # Response violates policy return {"error": result.block_reasons}Best Practices
Choosing Evaluators
- General content — Use Hallucination, Toxicity & Safety, Bias Detection
- Domain-specific — Add custom evaluators for your industry
- Agent outputs — Use agentic metrics for agent evaluation
- Strict compliance — Run all evaluators; set high consensus thresholds
Setting Thresholds
- Critical content (medical, legal, financial) — 0.9+ consensus, all evaluators
- General use — 0.75 consensus, standard evaluators
- Uncertain cases — Flag for human review if confidence < 0.7
Sampling Strategy
- Fast evaluation — 1-2 samples, trade off reliability for speed
- Balanced — 3-5 samples (default), good reliability with acceptable latency
- High assurance — 10+ samples, maximum reliability for critical decisions