Rllama

Ruby bindings for llama.cpp to run open-source language models locally. Run models like GPT-OSS, Qwen 3, Gemma 3, Llama 3, and many others directly in your Ruby application code.

Installation

Add this line to your application's Gemfile:

gem 'rllama'

And then execute:

bundle install

Or install it yourself as:

gem install rllama

CLI Chat

The rllama command-line utility provides an interactive chat interface for conversing with language models. After installing the gem, you can start chatting immediately:

rllama

When you run rllama without arguments, it will display:

Downloaded models: Any models you've already downloaded to ~/.rllama/models/
Popular models: A curated list of popular models available for download, including:
- Gemma 3 1B
- Llama 3.2 3B
- Phi-4
- Qwen3 30B
- GPT-OSS

Simply enter the number of the model you want to use. If you select a model that hasn't been downloaded yet, it will be automatically downloaded from Hugging Face.

You can also specify a model path or URL directly:

rllama path/to/your/model.gguf

rllama https://huggingface.co/microsoft/phi-4-gguf/resolve/main/phi-4-Q3_K_S.gguf

Once the model has loaded, you can start chatting.

Usage

Text Generation

Generate text completions using local language models:

require 'rllama'

# Load a model
model = Rllama.load_model('lmstudio-community/gemma-3-1B-it-QAT-GGUF/gemma-3-1B-it-QAT-Q4_0.gguf')

# Generate text
result = model.generate('What is the capital of France?')
puts result.text
# => "The capital of France is Paris."

# Access generation statistics
puts "Tokens generated: #{result.stats[:tokens_generated]}"
puts "Tokens per second: #{result.stats[:tps]}"
puts "Duration: #{result.stats[:duration]} seconds"

# Don't forget to close the model when done
model.close

Generation parameters

Adjust the generation with parameters:

result = model.generate(
  'Write a short poem about Ruby programming',
  max_tokens: 2024,
  temperature: 0.8,
  top_k: 40,
  top_p: 0.95,
  min_p: 0.05
)

Streaming generation

Stream generated text token-by-token:

model.generate('Explain quantum computing') do |token|
  print token
end

System prompt

Include system promt to guide model behavior:

result = model.generate(
  'What are best practices for Ruby development?',
  system: 'You are an expert Ruby developer with 10 years of experience.'
)

Messages list

Pass multiple messages with roles for more complex interactions:

result = model.generate([
  { role: 'system', content: 'You are a helpful assistant.' },
  { role: 'user', content: 'What is the capital of France?' },
  { role: 'assistant', content: 'The capital of France is Paris.' },
  { role: 'user', content: 'What is its population?' }
])
puts result.text

Chat

For ongoing conversations, use a context object that maintains the conversation history:

# Initialize a chat context
context = model.init_context

# Send messages and maintain conversation history
response1 = context.message('What is the capital of France?')
puts response1.text
# => "The capital of France is Paris."

response2 = context.message('What is the population of that city?')
puts response2.text
# => "Paris has a population of approximately 2.1 million people..."

response3 = context.message('What was my first message?')
puts response3.text
# => "Your first message was asking about the capital of France."

# The context remembers all previous messages in the conversation

# Close context when done
context.close

Embeddings

Generate vector embeddings for text using embedding models:

require 'rllama'

# Load an embedding model
model = Rllama.load_model('lmstudio-community/embeddinggemma-300m-qat-GGUF/embeddinggemma-300m-qat-Q4_0.gguf')

# Generate embedding for a single text
embedding = model.embed('Hello, world!')
puts embedding.length
# => 724 (depending on your model)

# Generate embeddings for multiple sentences
embeddings = model.embed([
  'roses are red',
  'violets are blue',
  'sugar is sweet'
])

puts embeddings.length
# => 3
puts embeddings[0].length
# => 768

model.close

Vector parameters

By default, embedding vectors are normalized. You can disable normalization with normalize: false:

# Generate unnormalized embeddings
embedding = model.embed('Sample text', normalize: false)

Finding Models

You can download GGUF format models from various sources:

Hugging Face - Search for models with "GGUF" format

License

MIT

Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/docusealco/rllama.

rllama

Runtime