0.0
The project is in a healthy, maintained state
A failure-aware, contract-driven Ruby client for the Ollama API. Provides deterministic /generate with strict JSON schema validation, automatic model pulling, exponential backoff on timeouts, and observer-style streaming hooks. Designed for Rails background jobs and agent planners — not a chatbot UI.
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
 Dependencies

Runtime

 Project Readme

Ollama::Client

CI Gem Version Ruby License: MIT

The production-safe Ruby AI SDK for Ollama.

Not a chatbot UI. Not a 1:1 API wrapper. A failure-aware, contract-driven client that covers all 12 Ollama API endpoints with production guarantees.

Correctness. Determinism. Failure-aware design. Nothing else.

ollama-client is purpose-built for Rails AI features, background jobs, CLIs, agents, autonomous systems, workflow engines, RAG pipelines, structured-output consumers, MCP servers, and evaluation systems. If you use Ollama from Ruby, this is the foundation layer.

Why This Gem Exists

Other Ollama clients give you raw HTTP access. This one gives you production guarantees:

What goes wrong What other gems do What ollama-client does
Model isn't downloaded Raise error Auto-pull → retry
Ollama server is down Hang for 60s Fast-fail instantly
LLM returns broken JSON Crash your parser Repair prompt → retry
Request times out Raise immediately Exponential backoff
Schema violation You find out in prod SchemaViolationError before it reaches your code

Installation

gem "ollama-client"

Quick Start

Works out of the box — all defaults are production-safe:

require "ollama_client"

client = Ollama::Client.new
# model: "llama3.2:3b", timeout: 30, retries: 2, strict_json: true

Ollama Cloud Multi-Key Failover

For hosted models on https://ollama.com, configure either one API key or a comma-separated key pool. OLLAMA_API_KEYS takes precedence over OLLAMA_API_KEY; when a cloud request receives HTTP 429, ollama-client transparently retries the same request with the next configured key. If every key is rate-limited, the client waits with exponential backoff (2 ** attempt) and retries the pool until config.retries is exhausted, then raises Ollama::RateLimitExhaustedError.

OLLAMA_BASE_URL=https://ollama.com
OLLAMA_API_KEYS=key_abc123,key_xyz789
ENABLE_MULTI_KEY_CONCURRENCY=false # set true to round-robin initial keys across concurrent threads
config = Ollama::Config.new
config.base_url = "https://ollama.com"
config.api_keys = ENV["OLLAMA_API_KEYS"] # accepts comma-separated strings or arrays
config.enable_multi_key_concurrency = true

client = Ollama::Client.new(config: config)
client.chat(messages: [{ role: "user", content: "Hello" }], model: "gpt-oss:120b-cloud")

For Sidekiq or other highly concurrent agent loops, keep configuration immutable after boot and instantiate clients with per-client Ollama::Config objects rather than mutating OllamaClient.configure at runtime.

Chat (Multi-turn Conversations)

The primary endpoint for agentic usage:

response = client.chat(
  messages: [
    { role: "system", content: "You are a helpful assistant." },
    { role: "user", content: "What is Ruby?" }
  ]
)

response.message.content  # => "Ruby is a dynamic, open source..."
response.message.role     # => "assistant"
response.done?            # => true
response.done_reason      # => "stop"
response.total_duration   # => 1234567 (nanoseconds)

Tool Calling

messages = [{ role: "user", content: "What is the weather in London?" }]

tools = [
  {
    type: "function",
    function: {
      name: "get_weather",
      description: "Get weather for a city",
      parameters: {
        type: "object",
        properties: { city: { type: "string" } },
        required: ["city"]
      }
    }
  }
]

response = client.chat(messages: messages, tools: tools)
response.message.tool_calls.first.name       # => "get_weather"
response.message.tool_calls.first.arguments  # => { "city" => "London" }

Structured Output (JSON Schema)

messages = [{ role: "user", content: "What is the capital of France? Answer in JSON." }]
schema = { type: "object", properties: { answer: { type: "string" } } }

response = client.chat(messages: messages, format: schema)
JSON.parse(response.message.content)  # => { "answer" => "Paris" }

Thinking Mode

Note: Requires a thinking-capable model (e.g. deepseek-coder:6.7b, qwen3:0.6b).

messages = [{ role: "user", content: "What is the square root of 144?" }]

response = client.chat(messages: messages, model: "qwen3:0.6b", think: true)
response.message.thinking  # => "Let me reason through this..."
response.message.content   # => "The answer is 12."

Chat Options

Simple approach (auto-inferred schemas):

messages = [{ role: "user", content: "Hello" }]

client.chat(
  messages: messages,
  model: "qwen2.5-coder:7b",             # Override default model
  options: { temperature: 0.8 }, # Runtime options
  keep_alive: "10m",           # Keep model loaded
  logprobs: true,              # Return log probabilities
  top_logprobs: 5
)

Generate (Prompt → Completion)

client.generate(prompt: "Explain Ruby blocks in one sentence.")
# => "Ruby blocks are anonymous closures passed to methods..."

Structured JSON (Agents / Planners)

schema = {
  "type" => "object",
  "required" => ["action", "confidence"],
  "properties" => {
    "action" => { "type" => "string", "enum" => ["search", "calculate", "finish"] },
    "confidence" => { "type" => "number" }
  }
}

result = client.generate(prompt: "User wants weather in Paris.", schema: schema)
result["action"]     # => "search"
result["confidence"] # => 0.95

If the LLM returns invalid JSON, the client automatically retries with a repair prompt. You get valid output or a typed exception — never a silent failure.

Structured Thinking (Zero-Magic CoT extraction)

You can ask reasoning models to output their thoughts separately from the final answer. ollama-client enforces this via strict JSON schema prompting.

Note: Requires a thinking model. Supported defaults: /deepseek/i, /qwen/i, /r1/i.

schema = {
  "type" => "object",
  "required" => ["decision"],
  "properties" => {
    "decision" => { "type" => "string" }
  }
}

result = client.generate(
  model: "deepseek-r1",
  prompt: "Should we BUY or WAIT?",
  schema: schema,
  think: true,
  return_reasoning: true
)

result["reasoning"]          # => "...step by step analysis..."
result["final"]["decision"]  # => "WAIT"

Generate Options

client.generate(
  prompt: "Write a poem",
  model: "qwen3:0.6b",               # Explicitly use a thinking model
  system: "You are a poet",          # System prompt
  think: true,                       # Thinking output
  keep_alive: "5m",                  # Keep model loaded
  options: { temperature: 0.8 }      # Runtime options
)

Streaming (Observer Hooks)

No raw SSE. No state corruption risk. Works with both chat and generate:

# Stream generate tokens
client.generate(
  prompt: "Write a haiku about code.",
  hooks: {
    on_token:    ->(token) { print token },
    on_error:    ->(err)   { warn err.message },
    on_complete: ->        { puts "\nDone" }
  }
)

# Stream chat tokens with log probabilities
client.chat(
  messages: [{ role: "user", content: "Tell me a story" }],
  logprobs: true,
  hooks: {
    # If your block takes 2 args, it receives the logprobs array for that token
    on_token: ->(token, logprobs) {
      print token
      # logprobs is an Array of Hashes, e.g. [{"token"=>"Once", "logprob"=>-0.12}, ...]
    },
    on_complete: -> { puts }
  }
)

Embeddings (RAG)

client.embeddings.embed(model: "nomic-embed-text:latest", input: "What is Ruby?")
# => [0.12, -0.05, 0.88, ...]

# Batch embeddings
client.embeddings.embed(model: "nomic-embed-text:latest", input: ["text1", "text2"])

# With options
client.embeddings.embed(
  model: "nomic-embed-text:latest",
  input: "text",
  truncate: true,        # Truncate long inputs
  dimensions: 256,       # Embedding dimensions
  keep_alive: "5m"       # Keep model loaded
)

Model Management

client.list_models              # Returns models with details & automatic capabilities map
# => [{ "name" => "llama3.1", "capabilities" => { "tools" => true, "thinking" => false, ... }, ... }]
client.list_model_names         # Just names: ["qwen2.5-coder:7b", "llama3.2:3b", ...]
client.list_running             # Currently loaded models (aliased as `ps`)
client.show_model(model: "qwen2.5-coder:7b")           # Model details, capabilities
client.show_model(model: "qwen2.5-coder:7b", verbose: true)  # Include model_info
client.pull("llama3.2:3b")                      # Download a model
client.delete_model(model: "old-model")      # Remove a model
client.copy_model(source: "qwen2.5-coder:7b", destination: "qwen2.5-coder:7b-backup")
client.create_model(model: "my-model", from: "qwen2.5-coder:7b", system: "You are Alpaca")
client.push_model(model: "user/my-model")    # Push to registry
client.version                               # => "0.12.6"

Runtime Options

Pass via options: on chat or generate:

messages = [{ role: "user", content: "Tell me a joke" }]

options = Ollama::Options.new(
  temperature: 0.7,
  num_predict: 256,
  stop: ["END"],
  presence_penalty: 0.5,
  frequency_penalty: -0.3
)

client.chat(messages: messages, options: options.to_h)
All supported options
Option Type Description
temperature Float (0–2) Sampling temperature
top_p Float (0–1) Nucleus sampling
top_k Integer Top-K sampling
num_ctx Integer Context window size
num_predict Integer Max tokens to generate
repeat_penalty Float (0–2) Repeat penalty
seed Integer Random seed
stop Array Stop sequences
tfs_z Float Tail-free sampling
mirostat 0/1/2 Mirostat sampling mode
mirostat_tau Float Mirostat target entropy
mirostat_eta Float Mirostat learning rate
typical_p Float (0–1) Typical-p sampling
presence_penalty Float (-2–2) Presence penalty
frequency_penalty Float (-2–2) Frequency penalty
num_gpu Integer GPU layers
num_thread Integer CPU threads
num_keep Integer Tokens to keep for context

CLI

A strict, JSON-first CLI ships with the gem:

# Generate text
ollama-client generate --prompt "Explain Ruby blocks"

# Structured output with schema
echo '{"type":"object","properties":{"category":{"type":"string"}}}' > schema.json
ollama-client generate --prompt "Classify this" --schema schema.json --json

# Stream tokens
ollama-client generate --prompt "Write a poem" --stream

# Embeddings
ollama-client embed --input "What is Ruby?" --model nomic-embed-text:latest

# List models
ollama-client models

# Pull a model
ollama-client pull llama3.2:3b

All errors output as structured JSON to stderr. No hidden behavior.

Examples

The examples/ directory contains working scripts for common patterns:

Cloud Model Accessibility Probe

If you use Ollama Cloud, this script lists all cloud models and probes each one to determine whether your account can run inference against it:

export OLLAMA_API_KEY="your-ollama-cloud-api-key"
bundle exec ruby examples/cloud_models.rb

Output is a sorted JSON array:

[
  { "name": "gpt-oss:20b", "accessible": true, "reason": null },
  { "name": "deepseek-v4-pro", "accessible": false, "reason": "plan_restricted" }
]

See examples/README.md for the full list of examples and reason codes.

Console (Debug Mode)

bin/console
verbose!  # Enable HTTP request/response logging
quiet!    # Disable it

client = Ollama::Client.new
client.version  # Prints full HTTP request/response to STDERR

Failure Behaviors

Scenario What happens
Model missing (404) Auto-pull → retry your request
Server unreachable Instant Ollama::Error — no waiting
Timeout Exponential backoff (2^attempt seconds)
Invalid JSON Repair prompt → retry → InvalidJSONError if exhausted
Schema violation Repair prompt → retry → SchemaViolationError if exhausted
Streaming error StreamError raised with Ollama's error message

v1.0 Stability Contract

The public API is locked. See API_CONTRACT.md for the full specification.

  1. All method signatures are stable until v2.0
  2. Error class hierarchy is stable until v2.0
  3. Recovery behaviors (auto-pull, backoff, repair) are guaranteed
  4. No silent coercion of malformed JSON — ever
  5. Typed errors over generic exceptions — always

Testing

# Unit + lint
bundle exec rake

# Integration (requires running Ollama)
OLLAMA_INTEGRATION=1 bundle exec rspec spec/integration/

License

MIT. See LICENSE.txt.

OpenAI-Compatible Facade (Optional Extension)

OpenAI compatibility is intentionally isolated from the core runtime. Load it explicitly when needed:

require "ollama_client"
require "ollama/openai"

client = Ollama::Client.new

client.openai.models.list
client.openai.chat.completions.create(
  model: "qwen2.5-coder:7b",
  messages: [{ role: "user", content: "hello" }]
)
client.openai.completions.create(model: "llama3.2:3b", prompt: "Write one line")
client.openai.embeddings.create(model: "nomic-embed-text", input: "ruby")

Raw Endpoint Escape Hatch (New)

Access unsupported and future endpoints without waiting for wrapper updates:

client = Ollama::Client.new

client.raw.post("/api/chat", payload: {
  model: "llama3.2:3b",
  messages: [{ role: "user", content: "hello" }],
  stream: false
})

Transport Adapter (Foundation)

The client now resolves HTTP through a transport adapter boundary. Default remains Net::HTTP, and the API is forward-compatible with future adapters.

config = Ollama::Config.new
config.transport_adapter = :net_http
client = Ollama::Client.new(config: config)

Transport internals now normalize responses through a transport response object (status, headers, body, duration_ms) to support future adapters and observability. A stream transport contract (transport.stream) is also defined as the next expansion point.

Mock transport (testing)

For deterministic tests without a live Ollama server:

config = Ollama::Config.new
config.transport_adapter = :mock
client = Ollama::Client.new(config: config)

transport = client.instance_variable_get(:@transport)
transport.enqueue(status: 200, body: '{"version":"0.0.0-test"}')
client.version # => "0.0.0-test"

Error taxonomy foundation

Runtime now includes explicit typed transport/runtime errors such as: UnauthorizedError, ModelUnavailableError, ConnectionFailedError, and MalformedResponseError for safer retry and policy layering.