0.0
No release in over 3 years
There's a lot of open issues
Fast tokenization for Ruby using HuggingFace's Rust-powered tokenizers library. Supports GPT, BERT, LLaMA, Claude, and any HuggingFace tokenizer.
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
 Dependencies

Runtime

~> 0.9
 Project Readme

tokenizer-ruby

Ruby bindings for HuggingFace Tokenizers. Fast, Rust-powered tokenization for any HuggingFace model — GPT-2, BERT, LLaMA, Claude, and more.

Installation

gem install tokenizer-ruby

Or add to your Gemfile:

gem "tokenizer-ruby"

Note: Requires Rust toolchain for compilation. Install via rustup.

Usage

Load a tokenizer

require "tokenizer_ruby"

# From HuggingFace Hub
tokenizer = TokenizerRuby::Tokenizer.from_pretrained("gpt2")
tokenizer = TokenizerRuby::Tokenizer.from_pretrained("bert-base-uncased")

# From a local file
tokenizer = TokenizerRuby::Tokenizer.from_file("/path/to/tokenizer.json")

Encode and decode

encoding = tokenizer.encode("Hello, world!")
encoding.ids           # => [15496, 11, 995, 0]
encoding.tokens        # => ["Hello", ",", " world", "!"]
encoding.offsets       # => [[0, 5], [5, 6], [6, 12], [12, 13]]
encoding.attention_mask # => [1, 1, 1, 1]
encoding.length        # => 4

tokenizer.decode([15496, 11, 995, 0])  # => "Hello, world!"

Batch processing

encodings = tokenizer.encode_batch(["Hello", "World"])
decoded = tokenizer.decode_batch(encodings.map(&:ids))
# => ["Hello", "World"]

Token counting

tokenizer.count("Hello, world!")  # => 4

Truncation

# Truncate text to a token limit
tokenizer.truncate("This is a long sentence...", max_tokens: 5)

# Enable automatic truncation on all encodes
tokenizer.enable_truncation(max_length: 512)

Padding

tokenizer.enable_padding(length: 128, pad_token: "[PAD]")
encoding = tokenizer.encode("Hello")
encoding.ids.length        # => 128
encoding.attention_mask     # => [1, 0, 0, 0, ...]

Vocabulary

tokenizer.vocab_size           # => 50257
tokenizer.token_to_id("hello") # => 31373
tokenizer.id_to_token(31373)   # => "hello"

Requirements

  • Ruby >= 3.1
  • Rust toolchain (for building from source)

Development

bundle install
bundle exec rake compile
bundle exec rake test

License

MIT

Author

Johannes Dwi Cahyo — @johannesdwicahyo