mini_embed

A minimal, dependency‑free C extension for Ruby that loads GGUF embedding models and computes text embeddings locally.

⚠️ Important: This gem is intended for small projects, prototypes, and hobbyist use. It allows you to experiment with embeddings without relying on external APIs or cloud costs. Do not use MiniEmbed in production – it lacks the performance, scalability, and tokenization robustness of dedicated solutions. For real applications, use a proper inference server like llama.cpp with its HTTP API, or managed services such as OpenAI, Cohere, or Hugging Face.

Why MiniEmbed?

Zero external dependencies – no TensorFlow, PyTorch, or ONNX runtime.
Single‑file C extension – fast loading and local sentence embeddings.
Runs BERT/GTE embedding models end to end – token embeddings, transformer layers, mean pooling, and L2 normalization.
Supports the GGML tensor types declared by this extension – F32, F16, Q4_0, Q4_1, Q5_0, Q5_1, Q8_0, Q8_1, and K-quants through Q8_K.
Works entirely offline – your data never leaves your machine.
Perfect for weekend projects, proof‑of‑concepts, or learning about embeddings.

Installation

Add this line to your application's Gemfile:

gem 'mini_embed'

Then execute:

bundle install

Or install it globally:

gem install mini_embed

Requirements:

A POSIX system (Linux, macOS, BSD) – Windows via WSL2 works.

A C compiler and make (for compiling the native extension).

A GGUF embedding model file (see Where to get models).

Usage

require 'mini_embed'

# Load a GGUF embedding model
model = MiniEmbed.new(model: '/path/to/gte-small.Q4_0.gguf')

# Get embedding as an array of floats (default)
embedding = model.embeddings(text: 'hello world')
puts embedding.size   # e.g. 384
puts embedding[0..4]  # e.g. [0.0123, -0.0456, ...]

# Or get the raw binary string (little‑endian 32‑bit floats)
binary = model.embeddings(text: 'hello world', type: :binary)
embedding_from_binary = binary.unpack('e*')

Note: The type parameter is optional – it defaults to :vector which returns a Ruby Array<Float>. Use type: :binary to get the raw binary string (compatible with the original C extension).

You can also request L2 normalization for the fallback token-averaging path:

model = MiniEmbed.new(model: '/path/to/model.gguf', normalize: :l2)

For supported BERT/GTE GGUF models, MiniEmbed already returns L2-normalized sentence embeddings to match llama.cpp embedding output.

Tokenization And Model Support

For BERT/GTE-style GGUF embedding models, MiniEmbed uses the model's WordPiece vocabulary, adds CLS/SEP tokens, runs the transformer stack, mean-pools the sequence output, and L2-normalizes the result.

For non-BERT GGUF models, MiniEmbed falls back to pre-tokenization plus vocabulary/BPE lookup and averages token embedding rows. That fallback is useful for simple experiments, but it is not equivalent to running a full transformer model.

If you need a model/tokenizer family that is not covered by the current C path, you can:

Pre‑tokenize in Ruby using the tokenizers gem and pass token IDs (not yet exposed in the C API, but easy to add).
Run llama.cpp as a server and call its embeddings endpoint.

Supported Quantization Types

Type	Description
0	F32 (float32)
1	F16 (float16)
2	Q4_0
3	Q4_1
6	Q5_0
7	Q5_1
8	Q8_0
9	Q8_1
10	Q2_K
11	Q3_K
12	Q4_K
13	Q5_K
14	Q6_K
15	Q8_K

The extension validates tensor row alignment while loading the GGUF file and dequantizes rows as they are used. Q4_0 linear layers have a ggml-like optimized dot path; other quantized linear layers use correctly dequantized float rows.

MiniEmbed supports the tensor types listed above. It does not currently implement newer llama.cpp formats that are not declared in this extension, such as IQ, MXFP, or NVFP variants.

Where to get models

Hugging Face offers many GGUF models, e.g.:

gte-small
all‑MiniLM‑L6‑v2

You can convert any safetensors or PyTorch model using the convert‑hf‑to‑gguf.py script from llama.cpp.

For testing, we recommend the gte-small model (384 dimensions, ~30k vocabulary).

Limitations (Why this is not production‑ready)

Single‑threaded, blocking C code – embedding computation runs on the Ruby thread, freezing the interpreter.
No batching – only one text at a time.
BERT/GTE support is intentionally narrow and only covers the tensor/tokenizer shapes implemented in the C extension.
Model files are memory-mapped and tensor rows are dequantized on demand, but large GGUF files still consume address space and memory bandwidth.
No GPU support – CPU only.
Error handling is minimal – invalid models may crash the Ruby process.

If you need a robust, scalable solution, consider:

Running llama.cpp as a server (./server -m model.gguf --embeddings) and calling its HTTP endpoint.
Using a cloud embeddings API (OpenAI, Cohere, VoyageAI, etc.).
Deploying a dedicated inference service with BentoML or Ray Serve.

Development & Contributing

Bug reports and pull requests are welcome on GitHub. To run the tests:

bundle exec rspec

The gem uses rake-compiler to build the extension. After making changes to the C source, run:

bundle exec rake compile

License

MIT License. See LICENSE.