Project

tokenizers

0.01
The project is in a healthy, maintained state
Fast state-of-the-art tokenizers for Ruby
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
 Dependencies
 Project Readme

Tokenizers Ruby

🙂 Fast state-of-the-art tokenizers for Ruby

Build Status

Installation

Add this line to your application’s Gemfile:

gem "tokenizers"

Note: Rust and pkg-config are currently required for installation, and it can take 5-10 minutes to compile the extension.

Getting Started

Load a pretrained tokenizer

tokenizer = Tokenizers.from_pretrained("bert-base-cased")

Encode

encoded = tokenizer.encode("I can feel the magic, can you?")
encoded.ids
encoded.tokens

Decode

tokenizer.decode(ids)

Load a tokenizer from files

tokenizer = Tokenizers::CharBPETokenizer.new("vocab.json", "merges.txt")

History

View the changelog

Contributing

Everyone is encouraged to help improve this project. Here are a few ways you can help:

To get started with development:

git clone https://github.com/ankane/tokenizers-ruby.git
cd tokenizers-ruby
bundle install
bundle exec ruby ext/tokenizers/extconf.rb && make && make install
bundle exec rake download:files
bundle exec rake test