Project

rllama

0.0
The project is in a healthy, maintained state
Ruby bindings for Llama.cpp to run local LLMs in Ruby applications.
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
 Dependencies

Runtime

>= 1.0
 Project Readme
Logo

Rllama

Ruby bindings for llama.cpp to run open-source language models locally. Run models like GPT-OSS, Qwen 3, Gemma 3, Llama 3, and many others directly in your Ruby application code.

Installation

Add this line to your application's Gemfile:

gem 'rllama'

And then execute:

bundle install

Or install it yourself as:

gem install rllama

CLI Chat

The rllama command-line utility provides an interactive chat interface for conversing with language models. After installing the gem, you can start chatting immediately:

rllama

When you run rllama without arguments, it will display:

  • Downloaded models: Any models you've already downloaded to ~/.rllama/models/
  • Popular models: A curated list of popular models available for download, including:
    • Gemma 3 1B
    • Llama 3.2 3B
    • Phi-4
    • Qwen3 30B
    • GPT-OSS

Simply enter the number of the model you want to use. If you select a model that hasn't been downloaded yet, it will be automatically downloaded from Hugging Face.

You can also specify a model path or URL directly:

rllama path/to/your/model.gguf
rllama https://huggingface.co/microsoft/phi-4-gguf/resolve/main/phi-4-Q3_K_S.gguf

Once the model has loaded, you can start chatting.

Usage

Text Generation

Generate text completions using local language models:

require 'rllama'

# Load a model
model = Rllama.load_model('lmstudio-community/gemma-3-1B-it-QAT-GGUF/gemma-3-1B-it-QAT-Q4_0.gguf')

# Generate text
result = model.generate('What is the capital of France?')
puts result.text
# => "The capital of France is Paris."

# Access generation statistics
puts "Tokens generated: #{result.stats[:tokens_generated]}"
puts "Tokens per second: #{result.stats[:tps]}"
puts "Duration: #{result.stats[:duration]} seconds"

# Don't forget to close the model when done
model.close

Generation parameters

Adjust the generation with parameters:

result = model.generate(
  'Write a short poem about Ruby programming',
  max_tokens: 2024,
  temperature: 0.8,
  top_k: 40,
  top_p: 0.95,
  min_p: 0.05
)

Streaming generation

Stream generated text token-by-token:

model.generate('Explain quantum computing') do |token|
  print token
end

System prompt

Include system promt to guide model behavior:

result = model.generate(
  'What are best practices for Ruby development?',
  system: 'You are an expert Ruby developer with 10 years of experience.'
)

Messages list

Pass multiple messages with roles for more complex interactions:

result = model.generate([
  { role: 'system', content: 'You are a helpful assistant.' },
  { role: 'user', content: 'What is the capital of France?' },
  { role: 'assistant', content: 'The capital of France is Paris.' },
  { role: 'user', content: 'What is its population?' }
])
puts result.text

Chat

For ongoing conversations, use a context object that maintains the conversation history:

# Initialize a chat context
context = model.init_context

# Send messages and maintain conversation history
response1 = context.message('What is the capital of France?')
puts response1.text
# => "The capital of France is Paris."

response2 = context.message('What is the population of that city?')
puts response2.text
# => "Paris has a population of approximately 2.1 million people..."

response3 = context.message('What was my first message?')
puts response3.text
# => "Your first message was asking about the capital of France."

# The context remembers all previous messages in the conversation

# Close context when done
context.close

Embeddings

Generate vector embeddings for text using embedding models:

require 'rllama'

# Load an embedding model
model = Rllama.load_model('lmstudio-community/embeddinggemma-300m-qat-GGUF/embeddinggemma-300m-qat-Q4_0.gguf')

# Generate embedding for a single text
embedding = model.embed('Hello, world!')
puts embedding.length
# => 724 (depending on your model)

# Generate embeddings for multiple sentences
embeddings = model.embed([
  'roses are red',
  'violets are blue',
  'sugar is sweet'
])

puts embeddings.length
# => 3
puts embeddings[0].length
# => 768

model.close

Vector parameters

By default, embedding vectors are normalized. You can disable normalization with normalize: false:

# Generate unnormalized embeddings
embedding = model.embed('Sample text', normalize: false)

Finding Models

You can download GGUF format models from various sources:

License

MIT

Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/docusealco/rllama.