Project

tokenizers

0.04
The project is in a healthy, maintained state
Fast state-of-the-art tokenizers for Ruby
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
 Dependencies

Runtime

>= 0
 Project Readme

Tokenizers Ruby

🙂 Fast state-of-the-art tokenizers for Ruby

Build Status

Installation

Add this line to your application’s Gemfile:

gem "tokenizers"

Getting Started

Load a pretrained tokenizer

tokenizer = Tokenizers.from_pretrained("bert-base-cased")

Encode

encoded = tokenizer.encode("I can feel the magic, can you?")
encoded.tokens
encoded.ids

Decode

tokenizer.decode(ids)

Training

Create a tokenizer

tokenizer = Tokenizers::Tokenizer.new(Tokenizers::Models::BPE.new(unk_token: "[UNK]"))

Set the pre-tokenizer

tokenizer.pre_tokenizer = Tokenizers::PreTokenizers::Whitespace.new

Train the tokenizer (example data)

trainer = Tokenizers::Trainers::BpeTrainer.new(special_tokens: ["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
tokenizer.train(["wiki.train.raw", "wiki.valid.raw", "wiki.test.raw"], trainer)

Encode

output = tokenizer.encode("Hello, y'all! How are you 😁 ?")
output.tokens

Save the tokenizer to a file

tokenizer.save("tokenizer.json")

Load a tokenizer from a file

tokenizer = Tokenizers.from_file("tokenizer.json")

Check out the Quicktour and equivalent Ruby code for more info

API

This library follows the Tokenizers Python API. You can follow Python tutorials and convert the code to Ruby in many cases. Feel free to open an issue if you run into problems.

History

View the changelog

Contributing

Everyone is encouraged to help improve this project. Here are a few ways you can help:

To get started with development:

git clone https://github.com/ankane/tokenizers-ruby.git
cd tokenizers-ruby
bundle install
bundle exec rake compile
bundle exec rake download:files
bundle exec rake test