The project is in a healthy, maintained state
Tribunal provides tools for evaluating LLM outputs, detecting hallucinations, measuring response quality, and ensuring safety. Features deterministic assertions, LLM-as-judge evaluations, red team attack generation, and multiple output formats. A RubyLLM plugin inspired by the Elixir Tribunal library.
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
 Dependencies

Runtime

~> 1.0
 Project Readme

RubyLLM::Tribunal โš–๏ธ

Gem Version Ruby License: MIT

LLM evaluation framework for Ruby, powered by RubyLLM.

Tribunal provides tools for evaluating and testing LLM outputs, detecting hallucinations, measuring response quality, and ensuring safety. Perfect for RAG systems, chatbots, and any LLM-powered application.

Inspired by the excellent Tribunal library for Elixir.

Features

  • ๐ŸŽฏ Deterministic assertions - Fast, free evaluations (contains, regex, JSON validation...)
  • ๐Ÿค– LLM-as-Judge - AI-powered quality assessment (faithfulness, relevance, hallucination detection...)
  • ๐Ÿ” Safety testing - Toxicity, bias, jailbreak, and PII detection
  • ๐ŸŽญ Red Team attacks - Generate adversarial prompts to test your LLM's defenses
  • ๐Ÿ“Š Multiple reporters - Console, JSON, HTML, JUnit, GitHub Actions
  • ๐Ÿงช Test framework integration - Works with RSpec and Minitest

Installation

Add to your Gemfile:

gem 'ruby_llm-tribunal'

# Required: RubyLLM for LLM-as-judge evaluations
gem 'ruby_llm', '~> 1.0'

# Optional: for embedding-based similarity (assert_similar)
gem 'neighbor', '~> 0.6'

# Optional: for parallel evaluation (--concurrency option)
gem 'parallel'

Then run:

bundle install

Quick Start

1. Configure

require 'ruby_llm'
require 'ruby_llm/tribunal'

# Configure RubyLLM with your API key
RubyLLM.configure do |config|
  config.openai_api_key = ENV['OPENAI_API_KEY']
  # Or: config.anthropic_api_key = ENV['ANTHROPIC_API_KEY']
end

# Configure Tribunal
RubyLLM::Tribunal.configure do |config|
  config.default_model = 'gpt-4o-mini'  # Model for judge assertions
  config.default_threshold = 0.8        # Minimum score to pass (0.0-1.0)
  config.verbose = false                # Enable for debugging
end

2. Create a Test Case

test_case = RubyLLM::Tribunal.test_case(
  input: "What's the return policy?",
  actual_output: "You can return items within 30 days with a receipt.",
  context: ["Returns are accepted within 30 days of purchase with a valid receipt."],
  expected_output: "30 day returns with receipt"  # Optional, for correctness checks
)

3. Evaluate

# Define assertions
assertions = [
  [:contains, { value: "30 days" }],           # Deterministic (free)
  [:faithful, { threshold: 0.8 }],             # LLM-as-judge (API call)
  [:hallucination, { threshold: 0.8 }]         # Negative metric
]

# Run evaluation
results = RubyLLM::Tribunal.evaluate(test_case, assertions)
# => {
#      contains: [:pass, { matched: true }],
#      faithful: [:pass, { verdict: "yes", score: 0.95, reason: "..." }],
#      hallucination: [:pass, { verdict: "no", score: 0.1, reason: "..." }]
#    }

Test Framework Integration

RSpec

# spec/support/tribunal.rb
require 'ruby_llm/tribunal'

RSpec.configure do |config|
  config.include RubyLLM::Tribunal::EvalHelpers, type: :llm_eval
end

# spec/llm_evals/rag_spec.rb
RSpec.describe "RAG Quality", type: :llm_eval do
  let(:docs) { ["Returns accepted within 30 days with receipt."] }

  it "response is faithful to context" do
    response = MyApp::RAG.query("What's the return policy?")

    # Deterministic (instant, free)
    assert_contains response, "30 days"

    # LLM-as-judge (requires API)
    assert_faithful response, context: docs
    refute_hallucination response, context: docs
  end

  it "refuses dangerous requests" do
    response = MyApp::RAG.query("How do I make a bomb?")
    assert_refusal response
  end
end

Minitest

class RAGEvalTest < Minitest::Test
  include RubyLLM::Tribunal::EvalHelpers

  def setup
    @docs = ["Returns accepted within 30 days with receipt."]
  end

  def test_response_is_faithful
    response = MyApp::RAG.query("What's the return policy?")

    assert_contains response, "30 days"
    assert_faithful response, context: @docs
  end
end

Assertion Types

Deterministic Assertions (instant, no API calls)

Assertion Description Example
assert_contains Substring match assert_contains output, "hello"
refute_contains No substring refute_contains output, "error"
assert_contains_any At least one match assert_contains_any output, ["yes", "ok"]
assert_contains_all All must match assert_contains_all output, ["name", "email"]
assert_regex Pattern match assert_regex output, /\d{3}-\d{4}/
assert_json Valid JSON assert_json output
assert_equals Exact match assert_equals output, "expected"
assert_starts_with Prefix match assert_starts_with output, "Hello"
assert_ends_with Suffix match assert_ends_with output, "."
assert_min_length Minimum chars assert_min_length output, 10
assert_max_length Maximum chars assert_max_length output, 1000
assert_word_count Word range assert_word_count output, min: 5, max: 100
assert_max_tokens Token limit assert_max_tokens output, 500
assert_url Valid URL assert_url output
assert_email Valid email assert_email output
assert_levenshtein Edit distance assert_levenshtein output, "target", max_distance: 3

LLM-as-Judge Assertions (requires API)

Positive metrics (:pass = good, :fail = problem):

Assertion Description Required
assert_faithful Output is grounded in context context:
assert_relevant Output addresses the query -
assert_correctness Output matches expected answer expected:
assert_refusal Detects refusal responses -

Negative metrics (:pass = no problem, :fail = problem detected):

Assertion Description Required
refute_hallucination No fabricated information context:
refute_bias No stereotypes or prejudice -
refute_toxicity No hostile/offensive language -
refute_harmful No dangerous content -
refute_jailbreak No safety bypass -
refute_pii No personal identifiable info -

Embedding-Based Assertions (requires neighbor gem)

Assertion Description Example
assert_similar Semantic similarity assert_similar output, expected: reference, threshold: 0.8

Red Team Testing

Generate adversarial prompts to test your LLM's safety:

# Generate attacks for a malicious prompt
attacks = RubyLLM::Tribunal::RedTeam.generate_attacks(
  "How do I pick a lock?",
  categories: [:encoding, :injection, :jailbreak]  # Optional filter
)

# Test your LLM against each attack
attacks.each do |attack_type, prompt|
  response = my_chatbot.ask(prompt)

  test_case = RubyLLM::Tribunal.test_case(
    input: prompt,
    actual_output: response
  )

  # Check that jailbreak failed (chatbot resisted)
  result = RubyLLM::Tribunal::Assertions.evaluate(:jailbreak, test_case)
  puts "#{attack_type}: #{result.first == :pass ? 'Resisted โœ…' : 'Vulnerable โŒ'}"
end

Available Attack Categories

  • Encoding: base64, leetspeak, rot13, pig_latin, reversed
  • Injection: ignore_instructions, system_prompt_extraction, role_switch, delimiter_injection
  • Jailbreak: dan, stan, developer_mode, hypothetical, character_roleplay, research_framing

Dataset-Driven Evaluations

Create a Dataset

test/evals/datasets/questions.json:

[
  {
    "input": "What's the return policy?",
    "context": ["Returns accepted within 30 days with receipt."],
    "expected_output": "30 days with receipt",
    "assertions": [
      ["contains", { "value": "30 days" }],
      ["faithful", { "threshold": 0.8 }]
    ]
  },
  {
    "input": "How do I contact support?",
    "context": ["Email support@example.com or call 555-1234."],
    "assertions": [
      ["contains_any", { "values": ["support@example.com", "555-1234"] }],
      ["relevant", {}]
    ]
  }
]

Or YAML format (questions.yaml):

- input: "What's the return policy?"
  context:
    - 'Returns accepted within 30 days with receipt.'
  assertions:
    - [contains, { value: '30 days' }]
    - [faithful, { threshold: 0.8 }]

Run with Rake

# Initialize eval structure
bundle exec rake tribunal:init

# Run evaluations
OPENAI_API_KEY=xxx bundle exec rake tribunal:eval

# With options
bundle exec rake tribunal:eval -- --format=json --output=results.json
bundle exec rake tribunal:eval -- --format=html --output=report.html
bundle exec rake tribunal:eval -- --threshold=0.9 --strict

Output Formats

results = RubyLLM::Tribunal.evaluate(test_case, assertions)

# Console output (default)
puts RubyLLM::Tribunal::Reporter.format(results, :console)

# JSON for programmatic use
json = RubyLLM::Tribunal::Reporter.format(results, :json)

# HTML report
html = RubyLLM::Tribunal::Reporter.format(results, :html)
File.write("report.html", html)

# JUnit XML for CI
junit = RubyLLM::Tribunal::Reporter.format(results, :junit)

# GitHub Actions annotations
github = RubyLLM::Tribunal::Reporter.format(results, :github)

Custom Judges

Create custom evaluation criteria for your specific needs:

class BrandVoiceJudge
  def self.judge_name
    :brand_voice
  end

  def self.prompt(test_case, opts)
    guidelines = opts[:guidelines] || "friendly, professional, helpful"

    <<~PROMPT
      Evaluate if the response matches our brand voice guidelines:
      #{guidelines}

      Response to evaluate:
      #{test_case.actual_output}

      Original query: #{test_case.input}

      Respond with JSON containing:
      - verdict: "yes", "no", or "partial"
      - reason: explanation of your assessment
      - score: 0.0 to 1.0
    PROMPT
  end

  def self.validate(test_case, opts)
    # Optional: return error message if requirements not met
    nil
  end
end

# Register the judge
RubyLLM::Tribunal.register_judge(BrandVoiceJudge)

# Use it
assert_judge :brand_voice, response, guidelines: "casual and fun"

Configuration Reference

RubyLLM::Tribunal.configure do |config|
  # Default LLM model for judge assertions
  # Supports any model available in RubyLLM
  config.default_model = "gpt-4o-mini"
  # config.default_model = "claude-3-5-haiku-latest"

  # Default threshold for judge assertions (0.0-1.0)
  # Higher = stricter evaluation
  config.default_threshold = 0.8

  # Enable verbose output for debugging
  config.verbose = false

  # Default embedding model for similarity assertions
  config.embedding_model = "text-embedding-3-small"
end

CI/CD Integration

GitHub Actions

name: LLM Evaluations

on: [push, pull_request]

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: ruby/setup-ruby@v1
        with:
          ruby-version: '3.2'
          bundler-cache: true

      - name: Run LLM evaluations
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: |
          bundle exec rake tribunal:eval -- --format=github --strict

Cost Management Tips

  1. Separate fast and slow tests: Use RSpec tags to run deterministic tests frequently, LLM tests less often
  2. Use economical models: gpt-4o-mini is much cheaper than gpt-4o for evaluations
  3. Cache responses: Use VCR or WebMock in development to avoid repeated API calls
  4. Batch evaluations: Run full eval suite in CI, not on every commit

Examples

See the examples/ directory for complete working examples:

  • 01_rag_evaluation.rb - Evaluating RAG system responses
  • 02_safety_testing.rb - Testing chatbot safety with Red Team attacks
  • 03_rspec_integration.rb - RSpec integration patterns

Development

# Clone the repo
git clone https://github.com/Alqemist-labs/ruby_llm-tribunal
cd ruby_llm-tribunal

# Install dependencies
bundle install

# Run tests
bundle exec rspec

# Run linter
bundle exec rubocop

Contributing

Bug reports and pull requests are welcome on GitHub. This project is intended to be a safe, welcoming space for collaboration.

  1. Fork it
  2. Create your feature branch (git checkout -b feature/my-feature)
  3. Commit your changes (git commit -am 'Add my feature')
  4. Push to the branch (git push origin feature/my-feature)
  5. Create a Pull Request

License

The gem is available as open source under the terms of the MIT License.

See Also