RubyLLM::Evals

Test, compare, and improve your LLM prompts within your Rails application.

Installation

Note

This engine relies on ActiveJob, ActiveStorage, and RubyLLM. Make sure you have them installed and configured.

Add this line to your application's Gemfile:

gem "ruby_llm-evals"

And then execute:

$ bundle

To copy and migrate RubyLLM::Evals's migrations, run:

$ rails ruby_llm_evals:install:migrations db:migrate

And then mount the engine in your config/routes.rb:

Rails.application.routes.draw do
  # ...

  mount RubyLLM::Evals::Engine, at: "/evals"
end

Now you should be able to browse to /evals and create, test, compare, and improve your LLM prompts. Continue reading to see how a typical workflow looks like, and how you can leverage your app's data to add samples to your prompts.

Authentication and authorization

RubyLLM::Evals leaves authentication and authorization to the user. If no authentication is enforced, /evals will be available to everyone.

To enforce authentication, you can use route constraints, or set up a HTTP Basic auth middleware.

For example, if you're using devise, you can do this:

# config/routes.rb
authenticate :user do
  mount RubyLLM::Evals::Engine, at: "/evals"
end

See more examples here.

However, if you're using Rails' default authentication generator, or an authentication solution that doesn't provide constraints, you need to roll out your own solution:

# config/routes.rb
constraints ->(request) { Constraints::Auth.authenticated?(request) } do
  mount RubyLLM::Evals::Engine, at: "/evals"
end

# lib/constraints/auth.rb
class Constraints::Auth
  def self.authenticated?(request)
    cookies = ActionDispatch::Cookies::CookieJar.build(request, request.cookies)

    Session.find_by id: cookies.signed[:session_id]
  end
end

You can also set up a HTTP Basic auth middleware in the engine:

# config/initializers/ruby_llm-evals.rb
RubyLLM::Evals::Engine.middleware.use(Rack::Auth::Basic) do |username, password|
  ActiveSupport::SecurityUtils.secure_compare(Rails.application.credentials.ruby_llm_evals_username, username) &
    ActiveSupport::SecurityUtils.secure_compare(Rails.application.credentials.ruby_llm_evals_password, password)
end

Usage

Workflow

A typical workflow looks like this:

Create a prompt

A prompt represents an LLM prompt template with:

Provider: see available providers
Model: see available models. In case you're selecting a local provider (eg. Ollama), you can enter the model name in a text field.
Instructions: optional, the system prompt.
Message: message template.
Temperature: optional, controls randomness (0.0 to 1.0). Lower values make output more focused and deterministic.
Params: optional, additional provider-specific parameters as JSON (e.g., {"max_tokens": 1000}).
Tools: optional, array of tool class names that the LLM can use (e.g., ["Weather", "Calculator"]). See how tools are defined in RubyLLM.
Schema: optional, a Ruby class name (e.g., User) to structure the LLM's response, or use "other" to provide a custom JSON schema in the Schema Other field. See RubyLLM structured output.
Thinking effort: optional, controls the reasoning effort level for models that support thinking (e.g., low, medium, high). See how thinking works in RubyLLM.
Thinking budget: optional, sets a maximum token budget for thinking/reasoning.

Both the instructions and the message template can contain liquid tags that will be rendered at runtime. To add variables, enclose them with braces. Eg: {{name}}.

Note

In order to use a provider, you must have it configured in config/initializers/ruby_llm.rb as explained here

Add samples

When creating/editing a prompt you can add samples, where you can define:

Variables: a JSON that contains the values to use when executing the prompt. Eg: { "name": "Patricio" }
Eval type: the evaluation criteria: exact match, contains, regex, or human review.
Expected output: optional if the eval type is human
Files: optional attachments.

Run evaluations

Once you have a prompt with its examples you can run the evaluations. This will enqueue a job that will create an run and run each sample with the current prompt configuration.

The run will save the current prompt configuration for later analysis, such as the current provider/model, instructions, messages, variables, etc.

Analyze the results

You can view the accuracy, cost, and duration of the entire run and each individual prompt execution.

If you chose the human review eval type, it's now that you can review if an eval passed or not.

Pinned runs

When you find a run with particularly good results, you can pin it (only one per prompt). This helps you keep track of the best-performing prompt configurations as you iterate and experiment, but also will be how a prompt will be configured when you execute it (see below).

Beyond a typical workflow

Using your data to create prompts/samples

Suppose you want to categorize images. You can have a prompt (eg. image-categorization) and then add your data to the eval set:

prompt = RubyLLM::Evals::Prompt.find_by slug: "image-categorization"

Image.where(category: nil).take(50).each do |image|
  sample = prompt.samples.create eval_type: :human_judge
  sample.files.attach image.attachment.blob
end

Then you can iterate over the prompt trying to find the best configuration possible.

Using the prompt

Once you've tested and refined your prompt, you can use it in your application code.

Execute prompts by their slug to get a response object with content and metadata. If a pinned run exists for this prompt, it will use the pinned run's configuration (model, provider, temperature, etc.) instead of the prompt's current settings:

# Simple execution without variables
response = RubyLLM::Evals::Prompt.execute("image-categorization")
response.content  # => "landscape"

# With variables
response = RubyLLM::Evals::Prompt.execute(
  "text-summarization",
  variables: { "text" => "Long article content here..." }
)
response.content  # => "Brief summary of the article"

# With file attachments
response = RubyLLM::Evals::Prompt.execute(
  "image-categorization",
  files: [image.attachment.blob]
)
response.content  # => "person"

# Access token counts and metadata
response = RubyLLM::Evals::Prompt.execute(
  "sentiment-analysis",
  variables: { "text" => "I love this product!" }
)
response.content        # => "positive"
response.input_tokens   # => 25
response.output_tokens  # => 3

You can also execute a prompt directly on a Prompt instance:

prompt = RubyLLM::Evals::Prompt.find_by(slug: "sentiment-analysis")
response = prompt.execute(variables: { "text" => "I love this product!" })
response.content  # => "positive"

Building a chat without executing

For more control over the execution flow, you can build a configured RubyLLM::Chat object without immediately calling the LLM. This is useful for:

Inspecting the configured chat before execution
Modifying the chat further before completing
Testing prompt configurations

prompt = RubyLLM::Evals::Prompt.find_by(slug: "sentiment-analysis")

# Build the chat with all prompt configuration applied
chat = prompt.to_chat(variables: { "text" => "I love this product!" })

# The chat is configured but not executed yet
chat.messages.count  # => 1 (user message)

# Now execute when ready
response = chat.complete
response.content  # => "positive"

The to_chat method applies all prompt configuration.

Contributing

You can open an issue or a PR in GitHub.

License

The gem is available as open source under the terms of the MIT License.

ruby_llm-evals

Runtime