RubyLLM::SemanticCache
Semantic caching for RubyLLM. Cache responses based on meaning, not exact strings.
"What's the capital of France?" → Cache MISS, call LLM
"What is France's capital?" → Cache HIT (92% similar)
Embedding models cost ~1000x less than chat models, so every cache hit saves money.
Installation
gem 'ruby_llm-semantic_cache'Quick Start
# Wrap any RubyLLM chat - caching is automatic
chat = RubyLLM::SemanticCache.wrap(RubyLLM.chat(model: "gpt-5.2"))
chat.ask("What is Ruby?") # Calls API, caches response
# New conversation, same question = cache hit
chat2 = RubyLLM::SemanticCache.wrap(RubyLLM.chat(model: "gpt-5.2"))
chat2.ask("What is Ruby?") # Returns cached response instantlyOr use the fetch API for one-off queries:
response = RubyLLM::SemanticCache.fetch("What is Ruby?") do
RubyLLM.chat.ask("What is Ruby?")
endHow Caching Works
By default, only the first message of each conversation is cached. Follow-up messages go directly to the LLM because they depend on conversation context.
chat = RubyLLM::SemanticCache.wrap(RubyLLM.chat)
chat.ask("What is Ruby?") # Cached
chat.ask("Who created it?") # NOT cached (context-dependent)Cache keys include: model + system prompt + message. Different models or instructions = separate cache entries.
Configuration
RubyLLM::SemanticCache.configure do |config|
# Storage (default: :memory, use :redis for production)
config.vector_store = :redis
config.cache_store = :redis
config.redis_url = ENV["REDIS_URL"]
# Similarity threshold: 0.92 = recommended, higher = stricter
config.similarity_threshold = 0.92
# Cache expiration (default: nil = never)
config.ttl = 24 * 60 * 60
# Embedding model
config.embedding_model = "text-embedding-3-small"
config.embedding_dimensions = 1536
endWrapper Options
RubyLLM::SemanticCache.wrap(chat,
threshold: 0.95, # Override similarity threshold
ttl: 3600, # Override TTL (seconds)
max_messages: :unlimited # Cache all messages, not just first (default: 1)
# Also accepts: false (same as :unlimited), or Integer for custom limit
on_cache_hit: ->(chat, msg, resp) { log("Cache hit!") }
)Multi-Turn Caching
To cache entire conversation flows (not just first messages):
chat = RubyLLM::SemanticCache.wrap(RubyLLM.chat, max_messages: :unlimited)
# Conversation 1
chat.ask("What is Ruby?")
chat.ask("Who created it?")
# Conversation 2 - identical flow hits cache
chat2 = RubyLLM::SemanticCache.wrap(RubyLLM.chat, max_messages: :unlimited)
chat2.ask("What is Ruby?") # Cache HIT
chat2.ask("Who created it?") # Cache HIT (same context)Rails Integration
# config/initializers/semantic_cache.rb
RubyLLM::SemanticCache.configure do |config|
config.vector_store = :redis
config.cache_store = :redis
config.redis_url = ENV["REDIS_URL"]
config.namespace = Rails.env
endAdditional APIs
# Manual store
RubyLLM::SemanticCache.store(query: "What is Ruby?", response: message)
# Search similar
RubyLLM::SemanticCache.search("Tell me about Ruby", limit: 5)
# Check/delete
RubyLLM::SemanticCache.exists?("What is Ruby?")
RubyLLM::SemanticCache.delete("What is Ruby?")
# Stats
RubyLLM::SemanticCache.stats # => { hits: 150, misses: 20, hit_rate: 0.88 }
# Scoped caches (for multi-tenant)
support = RubyLLM::SemanticCache::Scoped.new(namespace: "support")
sales = RubyLLM::SemanticCache::Scoped.new(namespace: "sales")Requirements
- Ruby >= 2.7
- RubyLLM >= 1.0
- Redis 8+ with neighbor-redis (for production)
Roadmap
- Basic semantic caching
- Configurable similarity threshold
- Multi-turn caching
- Redis vector store
- Advanced eviction policies
- Web dashboard for cache stats?
- Support for more vector stores?
License
MIT