SemanticCache
Semantic caching for LLM API calls. Save 70%+ on costs.
Cache LLM responses using semantic similarity matching. Similar questions return cached answers instantly, cutting API costs dramatically.
cache = SemanticCache.new
# First call — hits the API
response = cache.fetch("What's the capital of France?") do
openai.chat(messages: [{ role: "user", content: "What's the capital of France?" }])
end
# Second call — semantically similar, returns cached response instantly
response = cache.fetch("What is France's capital city?") do
openai.chat(messages: [{ role: "user", content: "What is France's capital city?" }])
end
# => CACHE HIT! No API call.Installation
Add to your Gemfile:
gem "semantic-cache"Then:
bundle installOr install directly:
gem install semantic-cacheQuick Start
require "semantic_cache"
# Configure (or set OPENAI_API_KEY env var)
SemanticCache.configure do |c|
c.openai_api_key = "sk-..."
c.similarity_threshold = 0.85 # How similar queries must be to match (0.0-1.0)
end
cache = SemanticCache.new
response = cache.fetch("What is Ruby?", model: "gpt-4o") do
openai.chat(parameters: {
model: "gpt-4o",
messages: [{ role: "user", content: "What is Ruby?" }]
})
end
# Check stats
puts cache.current_stats
# => { hits: 0, misses: 1, hit_rate: 0.0, savings: "$0.00", ... }How It Works
- Your query is converted to an embedding vector via the configured embedding adapter (OpenAI or RubyLLM)
- The cache searches for stored entries with high cosine similarity
- If a match exceeds the threshold (default 0.85), the cached response is returned
- If no match, the block executes, and the result is cached for future queries
Configuration
SemanticCache.configure do |c|
# Similarity threshold (0.0 to 1.0). Higher = stricter matching.
c.similarity_threshold = 0.85
# Embedding adapter: :openai (default) or :ruby_llm
c.embedding_adapter = :openai
# Embedding model (used by the selected adapter)
c.embedding_model = "text-embedding-3-small"
# OpenAI API key (required for :openai adapter)
c.openai_api_key = ENV["OPENAI_API_KEY"]
# Default TTL for cached entries (nil = no expiry)
c.default_ttl = 3600 # 1 hour
# Cache store: :memory or :redis
c.store = :memory
c.store_options = {} # passed to Redis.new if store is :redis
# Cost tracking
c.track_costs = true
# Timeout for embedding API calls in seconds (default: 30, nil = no timeout)
c.embedding_timeout = 30
# Maximum number of entries in the cache (default: nil = unlimited)
# When exceeded, the oldest entry is evicted automatically.
c.max_cache_size = 10_000
endEmbedding Adapters
SemanticCache supports multiple embedding providers. Choose the adapter that fits your stack.
OpenAI (default)
Uses the ruby-openai gem. Requires an OpenAI API key.
SemanticCache.configure do |c|
c.embedding_adapter = :openai
c.embedding_model = "text-embedding-3-small"
c.openai_api_key = ENV["OPENAI_API_KEY"]
endRubyLLM
Uses the ruby_llm gem. Supports all embedding providers that RubyLLM supports: OpenAI, Gemini, Mistral, Ollama, Bedrock, and more — with a single adapter and no OpenAI dependency if you don't need it.
Add the gem:
# Gemfile
gem "ruby_llm"Configure SemanticCache to use the RubyLLM adapter:
SemanticCache.configure do |c|
c.embedding_adapter = :ruby_llm
c.embedding_model = "text-embedding-3-small" # or any model your RubyLLM provider supports
endThen configure your embedding provider via RubyLLM. Here are examples for popular providers:
# config/initializers/ruby_llm.rb (Rails) or anywhere before using SemanticCache
# OpenAI
RubyLLM.configure do |config|
config.openai_api_key = ENV["OPENAI_API_KEY"]
end
# Google Gemini
RubyLLM.configure do |config|
config.gemini_api_key = ENV["GEMINI_API_KEY"]
end
# Mistral
RubyLLM.configure do |config|
config.mistral_api_key = ENV["MISTRAL_API_KEY"]
end
# Ollama (local)
RubyLLM.configure do |config|
config.ollama_api_base = "http://localhost:11434/v1"
end
# AWS Bedrock (uses AWS credential chain by default)
RubyLLM.configure do |config|
config.bedrock_region = "us-east-1"
# Optional: explicit credentials
# config.bedrock_api_key = ENV["AWS_ACCESS_KEY_ID"]
# config.bedrock_secret_key = ENV["AWS_SECRET_ACCESS_KEY"]
endSupported embedding models (varies by provider):
- OpenAI:
text-embedding-3-small,text-embedding-3-large,text-embedding-ada-002 - Gemini:
text-embedding-004 - Mistral:
mistral-embed - Ollama: any model with embedding support (e.g.,
nomic-embed-text,mxbai-embed-large) - Bedrock:
amazon.titan-embed-text-v1,cohere.embed-english-v3, etc.
If the ruby_llm gem is not installed, using embedding_adapter = :ruby_llm raises a SemanticCache::ConfigurationError with instructions to add the gem.
Cache Stores
In-Memory (default)
Thread-safe, no dependencies. Good for development and single-process apps.
cache = SemanticCache.new(store: :memory)Redis
For production, multi-process, and distributed apps. Requires the redis gem.
gem "redis"cache = SemanticCache.new(
store: :redis,
store_options: { url: "redis://localhost:6379/0" }
)Custom Store
Any object that responds to write, entries, delete, invalidate_by_tags, clear, and size:
cache = SemanticCache.new(store: MyCustomStore.new)Cache Size Limits
Both stores support a max_size option. When the cache is full, the oldest entry (by creation time) is evicted automatically:
# Via constructor
cache = SemanticCache.new(max_size: 5_000)
# Via global configuration
SemanticCache.configure do |c|
c.max_cache_size = 10_000
endWhen max_size is nil (the default), the cache grows without limit.
TTL & Tag-Based Invalidation
# TTL — auto-expires after 1 hour
cache.fetch("Latest news?", ttl: 3600) do
fetch_news
end
# Tags — group related entries for bulk invalidation
cache.fetch("Ruby version?", tags: [:ruby, :versions]) do
"3.3.0"
end
cache.fetch("Best framework?", tags: [:ruby, :frameworks]) do
"Rails"
end
# Invalidate all entries tagged :versions
cache.invalidate(tags: [:versions])Multi-Model Support
Convenience methods for different LLM providers:
cache.fetch_openai("query", model: "gpt-4o") do
openai.chat(...)
end
cache.fetch_anthropic("query", model: "claude-sonnet-4-20250514") do
anthropic.messages(...)
end
cache.fetch_gemini("query", model: "gemini-pro") do
gemini.generate(...)
endClient Wrapper
Wrap an existing OpenAI client to cache all chat calls automatically:
require "openai"
client = OpenAI::Client.new(access_token: "sk-...")
cached_client = SemanticCache.wrap(client)
# All chat calls are now cached
response = cached_client.chat(parameters: {
model: "gpt-4o",
messages: [{ role: "user", content: "What is Ruby?" }]
})
# Access cache stats
cached_client.semantic_cache.current_stats
# Other methods are delegated to the original client
cached_client.models # => calls client.models directlyCost Tracking & Stats
cache = SemanticCache.new
# After some usage...
cache.current_stats
# => {
# hits: 156,
# misses: 44,
# total_queries: 200,
# hit_rate: 78.0,
# savings: "$23.45",
# ...
# }
puts cache.detailed_stats
# Total queries: 200
# Cache hits: 156
# Cache misses: 44
# Hit rate: 78.0%
# Total savings: $23.45
puts cache.savings_report
# Total saved: $23.45 (156 cached calls)Custom model costs:
SemanticCache.configure do |c|
c.model_costs["my-custom-model"] = { input: 0.01, output: 0.03 }
endRails Integration
# Gemfile
gem "semantic-cache"# config/initializers/semantic_cache.rb
require "semantic_cache/rails"
SemanticCache.configure do |c|
c.openai_api_key = Rails.application.credentials.openai_api_key
c.store = :redis
c.store_options = { url: ENV["REDIS_URL"] }
endUsing the Concern
class ChatController < ApplicationController
include SemanticCache::Cacheable
cache_ai_calls only: [:create], ttl: 1.hour
def create
response = SemanticCache.current.fetch(params[:message], model: "gpt-4o") do
openai_client.chat(parameters: {
model: "gpt-4o",
messages: [{ role: "user", content: params[:message] }]
})
end
render json: { response: response }
end
endPer-User Namespacing
class ApplicationController < ActionController::Base
around_action :with_semantic_cache
private
def with_semantic_cache
SemanticCache.with_cache(namespace: "user_#{current_user.id}") do
yield
end
end
endInput Validation
Queries are validated before any API call is made. Passing nil, "", or whitespace-only strings raises an ArgumentError immediately:
cache.fetch(nil) { ... } # => ArgumentError: query cannot be nil
cache.fetch("") { ... } # => ArgumentError: query cannot be blank
cache.fetch(" ") { ... } # => ArgumentError: query cannot be blankEmbedding Timeout
Embedding API calls are wrapped in a configurable timeout to prevent hanging threads:
SemanticCache.configure do |c|
c.embedding_timeout = 10 # seconds (default: 30)
endIf the timeout is exceeded, a SemanticCache::Error is raised. Set to nil to disable the timeout.
Demo
Run the built-in demo (no API key needed):
ruby examples/demo.rb --simulateOr with a real API key:
OPENAI_API_KEY=sk-... ruby examples/demo.rbDevelopment
bundle install
bundle exec rspecLicense
MIT License. See LICENSE.