RubyLLM::Evals
Test, compare, and improve your LLM prompts within your Rails application.
Installation
Note
This engine relies on ActiveJob, ActiveStorage, and RubyLLM. Make sure you have them installed and configured.
Add this line to your application's Gemfile:
gem "ruby_llm-evals"And then execute:
$ bundleTo copy and migrate RubyLLM::Evals's migrations, run:
$ rails ruby_llm_evals:install:migrations db:migrate
And then mount the engine in your config/routes.rb:
Rails.application.routes.draw do
# ...
mount RubyLLM::Evals::Engine, at: "/evals"
endNow you should be able to browse to /evals and create, test, compare, and improve your LLM prompts. Continue reading to see how a typical workflow looks like, and how you can leverage your app's data to add samples to your prompts.
Authentication and authorization
RubyLLM::Evals leaves authentication and authorization to the user. If no authentication is enforced, /evals will be available to everyone.
To enforce authentication, you can use route constraints, or set up a HTTP Basic auth middleware.
For example, if you're using devise, you can do this:
# config/routes.rb
authenticate :user do
mount RubyLLM::Evals::Engine, at: "/evals"
endSee more examples here.
However, if you're using Rails' default authentication generator, or an authentication solution that doesn't provide constraints, you need to roll out your own solution:
# config/routes.rb
constraints ->(request) { Constraints::Auth.authenticated?(request) } do
mount RubyLLM::Evals::Engine, at: "/evals"
end
# lib/constraints/auth.rb
class Constraints::Auth
def self.authenticated?(request)
cookies = ActionDispatch::Cookies::CookieJar.build(request, request.cookies)
Session.find_by id: cookies.signed[:session_id]
end
endYou can also set up a HTTP Basic auth middleware in the engine:
# config/initializers/ruby_llm-evals.rb
RubyLLM::Evals::Engine.middleware.use(Rack::Auth::Basic) do |username, password|
ActiveSupport::SecurityUtils.secure_compare(Rails.application.credentials.ruby_llm_evals_username, username) &
ActiveSupport::SecurityUtils.secure_compare(Rails.application.credentials.ruby_llm_evals_password, password)
endUsage
Workflow
A typical workflow looks like this:
Create a prompt
A prompt represents an LLM prompt template with:
- Provider: see available providers
- Model: see available models. In case you're selecting a local provider (eg. Ollama), you can enter the model name in a text field.
- Instructions: optional, the system prompt.
- Message: message template.
- Temperature: optional, controls randomness (0.0 to 1.0). Lower values make output more focused and deterministic.
- Params: optional, additional provider-specific parameters as JSON (e.g.,
{"max_tokens": 1000}). - Tools: optional, array of tool class names that the LLM can use (e.g.,
["Weather", "Calculator"]). See how tools are defined in RubyLLM. - Schema: optional, a Ruby class name (e.g.,
User) to structure the LLM's response, or use "other" to provide a custom JSON schema in the Schema Other field. See RubyLLM structured output. - Thinking effort: optional, controls the reasoning effort level for models that support thinking (e.g.,
low,medium,high). See how thinking works in RubyLLM. - Thinking budget: optional, sets a maximum token budget for thinking/reasoning.
Both the instructions and the message template can contain liquid tags that will be rendered at runtime. To add variables, enclose them with braces. Eg: {{name}}.
Note
In order to use a provider, you must have it configured in config/initializers/ruby_llm.rb as explained here
Add samples
When creating/editing a prompt you can add samples, where you can define:
- Variables: a JSON that contains the values to use when executing the prompt. Eg:
{ "name": "Patricio" } - Eval type: the evaluation criteria: exact match, contains, regex, or human review.
- Expected output: optional if the eval type is
human - Files: optional attachments.
Run evaluations
Once you have a prompt with its examples you can run the evaluations. This will enqueue a job that will create an run and run each sample with the current prompt configuration.
The run will save the current prompt configuration for later analysis, such as the current provider/model, instructions, messages, variables, etc.
Analyze the results
You can view the accuracy, cost, and duration of the entire run and each individual prompt execution.
If you chose the human review eval type, it's now that you can review if an eval passed or not.
Pinned runs
When you find a run with particularly good results, you can pin it (only one per prompt). This helps you keep track of the best-performing prompt configurations as you iterate and experiment, but also will be how a prompt will be configured when you execute it (see below).
Beyond a typical workflow
Using your data to create prompts/samples
Suppose you want to categorize images. You can have a prompt (eg. image-categorization) and then add your data to the eval set:
prompt = RubyLLM::Evals::Prompt.find_by slug: "image-categorization"
Image.where(category: nil).take(50).each do |image|
sample = prompt.samples.create eval_type: :human_judge
sample.files.attach image.attachment.blob
endThen you can iterate over the prompt trying to find the best configuration possible.
Using the prompt
Once you've tested and refined your prompt, you can use it in your application code.
Execute prompts by their slug to get a response object with content and metadata. If a pinned run exists for this prompt, it will use the pinned run's configuration (model, provider, temperature, etc.) instead of the prompt's current settings:
# Simple execution without variables
response = RubyLLM::Evals::Prompt.execute("image-categorization")
response.content # => "landscape"
# With variables
response = RubyLLM::Evals::Prompt.execute(
"text-summarization",
variables: { "text" => "Long article content here..." }
)
response.content # => "Brief summary of the article"
# With file attachments
response = RubyLLM::Evals::Prompt.execute(
"image-categorization",
files: [image.attachment.blob]
)
response.content # => "person"
# Access token counts and metadata
response = RubyLLM::Evals::Prompt.execute(
"sentiment-analysis",
variables: { "text" => "I love this product!" }
)
response.content # => "positive"
response.input_tokens # => 25
response.output_tokens # => 3You can also execute a prompt directly on a Prompt instance:
prompt = RubyLLM::Evals::Prompt.find_by(slug: "sentiment-analysis")
response = prompt.execute(variables: { "text" => "I love this product!" })
response.content # => "positive"Building a chat without executing
For more control over the execution flow, you can build a configured RubyLLM::Chat object without immediately calling the LLM. This is useful for:
- Inspecting the configured chat before execution
- Modifying the chat further before completing
- Testing prompt configurations
prompt = RubyLLM::Evals::Prompt.find_by(slug: "sentiment-analysis")
# Build the chat with all prompt configuration applied
chat = prompt.to_chat(variables: { "text" => "I love this product!" })
# The chat is configured but not executed yet
chat.messages.count # => 1 (user message)
# Now execute when ready
response = chat.complete
response.content # => "positive"The to_chat method applies all prompt configuration.
Contributing
You can open an issue or a PR in GitHub.
License
The gem is available as open source under the terms of the MIT License.


