ruby_llm-contract

Contracts + Evals for ruby_llm.

Your eval passed. Prod broke anyway? This gem wraps RubyLLM::Chat with input/output contracts, business-rule validation, retry with model escalation on validation failure, pre-flight cost ceilings, and a regression-eval framework — so a flaky cheap-model call escalates to a stronger model instead of shipping garbage to your user.

ruby_llm handles the HTTP side (rate limits, timeouts, streaming, tool calls, embeddings). This gem handles what the model returned at runtime: schema validation, business rules, model escalation on failed validation, regression datasets that gate prompt/model changes in CI.

Install

gem "ruby_llm-contract"

RubyLLM.configure do |c|
  c.openai_api_key = ENV["OPENAI_API_KEY"]
  c.default_model  = "gpt-4.1-mini"   # used when a Step has no explicit model
end

# Required: boots the gem so `Step.run` knows how to talk to your LLM.
# Empty block is fine. Pass options here if you need them (e.g. `c.logger`).
RubyLLM::Contract.configure { }

Works with any ruby_llm provider (OpenAI, Anthropic, Gemini, etc). Requires ruby_llm ~> 1.12 and Ruby ≥ 3.2.

Example

A Rails app takes article text extracted from a user-submitted URL and wants to show a summary card: a short TL;DR, 3–5 key takeaways, and a tone label. The output has to fit the UI (TL;DR under 200 chars) and the schema has to be strict enough to render without conditionals.

# app/contracts/summarize_article.rb
class SummarizeArticle < RubyLLM::Contract::Step::Base
  prompt <<~PROMPT
    Summarize this article for a UI card. Return a short TL;DR,
    3 to 5 key takeaways, and a tone label.

    {input}
  PROMPT

  output_schema do
    string :tldr
    array  :takeaways, of: :string, min_items: 3, max_items: 5
    string :tone, enum: %w[neutral positive negative analytical]
  end

  validate("TL;DR fits the card")  { |o, _| o[:tldr].length <= 200 }
  validate("takeaways are unique") { |o, _| o[:takeaways] == o[:takeaways].uniq }

  # Cheapest first; last step adds a reasoning model with more thinking.
  retry_policy do
    escalate "gpt-4.1-nano",
             "gpt-4.1-mini",
             { model: "gpt-5", reasoning_effort: "high" }
  end
end

result = SummarizeArticle.run(article_text)
result.status           # => :ok  (or :validation_failed if all steps fail)
result.parsed_output    # => { tldr: "...", takeaways: [...], tone: "..." }
result.trace[:model]    # => "gpt-4.1-mini"  (winning step)
result.trace[:cost]     # => 0.000520        (total across all attempts)

result.trace[:attempts]
# => [
#      {
#        attempt: 1,
#        model: "gpt-4.1-nano",
#        status: :validation_failed,
#        usage: { input_tokens: 256, output_tokens: 84 },
#        latency_ms: 45,
#        cost: 0.000100
#      },
#      {
#        attempt: 2,
#        model: "gpt-4.1-mini",
#        status: :ok,
#        usage: { input_tokens: 256, output_tokens: 92 },
#        latency_ms: 92,
#        cost: 0.000420
#      }
#    ]

If the response is malformed, the TL;DR overflows the card, or the takeaway count is off, the gem moves to the next step. This is model escalation, not a fallback list — each step is an independent config (model, reasoning_effort), so the retry policy spends more compute only when the cheaper one couldn't satisfy the contract.

Add a CI gate in 6 lines

The contract above already runs in production. The same Step doubles as the unit your regression eval runs against:

SummarizeArticle.define_eval("regression") do
  # `expected:` is a partial hash match — only listed keys check parsed_output.
  add_case "neutral release",
           input: "Ruby 3.4 shipped frozen string literals...",
           expected: { tone: "analytical" }
  add_case "outage post",
           input: "Service was down for 4 hours...",
           expected: { tone: "negative" }
end

# in CI (RSpec):
expect(SummarizeArticle).to pass_eval("regression").without_regressions

A bad prompt edit or model swap that drops accuracy on the frozen dataset → red CI, blocked merge. The first CI run records a baseline; subsequent runs compare against it. Every production miss should become the next add_case. See Prevent silent prompt regressions for the full flywheel.

Do I need this?

Use this if LLM output affects production behaviour, money, user trust, or downstream code. You probably don't need it if you have one low-risk prompt, manually inspect every result, or only generate best-effort prose.

Already using structured outputs from your provider? This gem adds business-rule validation, retry with model escalation, evals, regression gating, and test stubs on top of them — the layer that stops schema-valid-but-wrong output from reaching users. See Why contracts? for the four production failure modes the gem exists for.

Most useful next

Everything below is optional — the example above is a complete step. Reach for these when one step isn't enough.

CI regression gates — block CI when accuracy drops on a model update or prompt tweak.
Find the cheapest viable fallback list — empirically pick the cheapest model chain that still passes your evals.
A/B test prompts — measure whether a new prompt is safe to ship before merging.
Budget caps — refuse the request pre-flight when an estimate exceeds the limit.
Reasoning effort / thinking config — Anthropic / OpenAI thinking configuration on the Step class.

Also supports multi-step pipelines with fail-fast and per-step models.

Relation to `RubyLLM::Agent`

Step::Base and RubyLLM::Agent (since RubyLLM 1.12) are siblings targeting the same niche: reusable, class-based prompts. Both call into RubyLLM::Chat directly — Step does not wrap Agent. Step adds the contract layer: validate (business invariants), retry_policy escalate(...) (model escalation on validation failure), max_cost pre-flight refusal, regression-eval framework, pipeline composition. Full feature mapping →

Relation to `ruby_llm-tribunal`

Different layers, complementary. ruby_llm-tribunal is a test framework that grades outputs after they've reached your code, typically in a spec. ruby_llm-contract is runtime — schema + validate rules gate the call before the output reaches your code, retry/escalate attempts to recover from failed outputs, max_cost refuses pre-flight. Our define_eval is regression (does this prompt/model still pass on a frozen dataset?), not grading.

One-liner: Tribunal answers "is this output good?" (fail → red test in CI). Contract answers "what do we do when it isn't?" (fail → retry/escalate, or fail closed). Visual flows + coexistence patterns →

Docs

New here? Read in order: this README → Why contracts? → Getting Started.

Guide	What it does for your app
Why contracts?	Recognise the four production failures the gem exists for
Relation to RubyLLM::Agent	Sibling abstractions; what each adds; runtime call path; coexistence patterns
Relation to ruby_llm-tribunal	Different layers (test framework vs runtime contract); visual flows; integration recipes
Getting Started	Walk the full feature set on one concrete step
Rails integration	Directory, initializer, jobs, logging, specs, CI gate — 7 FAQs for Rails devs
Adopt in an existing Rails app	Replace raw `LlmClient.call` with a contract, Before/After
Prevent silent prompt regressions	Evals, baselines, CI gates that block quality drift
Control retry cost and fallback behaviour	Find the cheapest viable fallback list empirically
Write validate rules that catch real bugs	Patterns for cross-input checks and content-quality rules
Stub LLM calls in tests	Deterministic specs, RSpec + Minitest matchers
Chain LLM calls into a pipeline	Multi-step with fail-fast and per-step models
Schema DSL reference	Every constraint, nested objects, pattern table
Prompt DSL reference	`system` / `rule` / `section` / `example` / `user` nodes

Status & versioning

Pre-1.0 (currently 0.8.0). Semver tracked; breaking changes flagged in CHANGELOG. Pin ~> 0.8.0 until 1.0 ships.

FAQ

Thread-safe / Sidekiq? Yes. Each Step.run builds an isolated RubyLLM::Chat; class-level state (output_schema, validate, retry_policy) is set up once at class load and read-only afterwards. Safe to run from concurrent jobs/threads.

How do I stub Step.run in specs? Include RubyLLM::Contract::RSpec::Helpers and use stub_step(MyStep, response: { ... }). The block form scopes the stub to one it. See testing guide.

Where in a Rails app? Default app/contracts/. The Railtie reloads app/contracts/eval/ and app/steps/eval/ in development; any autoloaded directory also works. See Rails integration.

License

MIT