No release in over 3 years
Cache RubyLLM responses based on semantic similarity, not exact string matching. Reduces costs and latency by returning cached responses for semantically equivalent queries.
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
 Dependencies

Development

~> 13.0
~> 3.0
~> 1.50

Runtime

~> 1.0
 Project Readme

RubyLLM::SemanticCache

Semantic caching for RubyLLM. Cache responses based on meaning, not exact strings.

"What's the capital of France?" → Cache MISS, call LLM
"What is France's capital?"     → Cache HIT (92% similar)

Embedding models cost ~1000x less than chat models, so every cache hit saves money.

Installation

gem 'ruby_llm-semantic_cache'

Quick Start

# Wrap any RubyLLM chat - caching is automatic
chat = RubyLLM::SemanticCache.wrap(RubyLLM.chat(model: "gpt-5.2"))
chat.ask("What is Ruby?")  # Calls API, caches response

# New conversation, same question = cache hit
chat2 = RubyLLM::SemanticCache.wrap(RubyLLM.chat(model: "gpt-5.2"))
chat2.ask("What is Ruby?")  # Returns cached response instantly

Or use the fetch API for one-off queries:

response = RubyLLM::SemanticCache.fetch("What is Ruby?") do
  RubyLLM.chat.ask("What is Ruby?")
end

How Caching Works

By default, only the first message of each conversation is cached. Follow-up messages go directly to the LLM because they depend on conversation context.

chat = RubyLLM::SemanticCache.wrap(RubyLLM.chat)
chat.ask("What is Ruby?")     # Cached
chat.ask("Who created it?")   # NOT cached (context-dependent)

Cache keys include: model + system prompt + message. Different models or instructions = separate cache entries.

Configuration

RubyLLM::SemanticCache.configure do |config|
  # Storage (default: :memory, use :redis for production)
  config.vector_store = :redis
  config.cache_store = :redis
  config.redis_url = ENV["REDIS_URL"]

  # Similarity threshold: 0.92 = recommended, higher = stricter
  config.similarity_threshold = 0.92

  # Cache expiration (default: nil = never)
  config.ttl = 24 * 60 * 60

  # Embedding model
  config.embedding_model = "text-embedding-3-small"
  config.embedding_dimensions = 1536
end

Wrapper Options

RubyLLM::SemanticCache.wrap(chat,
  threshold: 0.95,         # Override similarity threshold
  ttl: 3600,               # Override TTL (seconds)
  max_messages: :unlimited # Cache all messages, not just first (default: 1)
  # Also accepts: false (same as :unlimited), or Integer for custom limit
  on_cache_hit: ->(chat, msg, resp) { log("Cache hit!") }
)

Multi-Turn Caching

To cache entire conversation flows (not just first messages):

chat = RubyLLM::SemanticCache.wrap(RubyLLM.chat, max_messages: :unlimited)

# Conversation 1
chat.ask("What is Ruby?")
chat.ask("Who created it?")

# Conversation 2 - identical flow hits cache
chat2 = RubyLLM::SemanticCache.wrap(RubyLLM.chat, max_messages: :unlimited)
chat2.ask("What is Ruby?")    # Cache HIT
chat2.ask("Who created it?")  # Cache HIT (same context)

Rails Integration

# config/initializers/semantic_cache.rb
RubyLLM::SemanticCache.configure do |config|
  config.vector_store = :redis
  config.cache_store = :redis
  config.redis_url = ENV["REDIS_URL"]
  config.namespace = Rails.env
end

Additional APIs

# Manual store
RubyLLM::SemanticCache.store(query: "What is Ruby?", response: message)

# Search similar
RubyLLM::SemanticCache.search("Tell me about Ruby", limit: 5)

# Check/delete
RubyLLM::SemanticCache.exists?("What is Ruby?")
RubyLLM::SemanticCache.delete("What is Ruby?")

# Stats
RubyLLM::SemanticCache.stats  # => { hits: 150, misses: 20, hit_rate: 0.88 }

# Scoped caches (for multi-tenant)
support = RubyLLM::SemanticCache::Scoped.new(namespace: "support")
sales = RubyLLM::SemanticCache::Scoped.new(namespace: "sales")

Requirements

Roadmap

  • Basic semantic caching
  • Configurable similarity threshold
  • Multi-turn caching
  • Redis vector store
  • Advanced eviction policies
  • Web dashboard for cache stats?
  • Support for more vector stores?

License

MIT