Riktoken

A pure Ruby partial implementation of OpenAI's tiktoken library for BPE (Byte Pair Encoding) tokenization. Riktoken enables you to encode and decode text using the same tokenizers as OpenAI's models like GPT-4, GPT-3.5, and others.

Most of the code is ported from openai/tiktoken.

Features

Pure Ruby implementation (no native dependencies) <= this is one of the main motivations for this library
No any dependencies
Compatible with OpenAI's tiktoken encodings (partial)
Supports all major OpenAI model encodings (cl100k_base, o200k_base, p50k_base, etc.)
Special token handling
Model-to-encoding mapping

Limitations

This library only supports UTF-8 strings. This means the encoder accepts UTF-8 text, and the decoder converts the given token array into a UTF-8 string.

Installation

Add this line to your application's Gemfile:

gem 'riktoken'

And then execute:

bundle install

Or install it yourself as:

gem install riktoken

Quick Start

Setting Up .tiktoken Files

You have to download the official .tiktoken files from OpenAI and locate them to arbitrary directory in advance:

# Create base directory as you like (`~/.riktoken` is the default location)
mkdir -p ~/.riktoken

# Download encoding files
curl -o ~/.riktoken/cl100k_base.tiktoken \
  https://raw.githubusercontent.com/openai/tiktoken/main/tiktoken/assets/cl100k_base.tiktoken

curl -o ~/.riktoken/o200k_base.tiktoken \
  https://raw.githubusercontent.com/openai/tiktoken/main/tiktoken/assets/o200k_base.tiktoken

# Add other encodings as needed...

The library will search for .tiktoken files in the given directory as a parameter tiktoken_base_dir (default is ENV[TIKTOKEN_BASE_DIR] || #{ENV['HOME']}/.riktoken/).

NOTE: If no .tiktoken file is found, the library will raise an error on loading; it does not fall back to built-in encodings and/or downloads the file automatically to avoid potential performance degration. i.e. the user must guarantee that the .tiktoken files are available in the specified directory.

Synopsis

require 'riktoken'

# Get encoding by name
# You have to prepare `.tiktoken` files in the specified directory in advance.
encoding = Riktoken.get_encoding("cl100k_base", tiktoken_base_dir: "#{ENV['HOME']}/.riktoken")

# Or get encoding for a specific model
# Once `tiktoken_base_dir` is omitted, it will use the directory `ENV[TIKTOKEN_BASE_DIR] || #{ENV['HOME']}/.riktoken/` as default.
encoding = Riktoken.encoding_for_model("gpt-4")

# Encode text to tokens
tokens = encoding.encode("Hello, world!")
# => [9906, 11, 1917, 0]

# Decode tokens back to text
text = encoding.decode(tokens)
# => "Hello, world!"

# Count tokens
token_count = encoding.encode("Your text here").length
# => 3

Supported Encodings

Encoding	Models	tiktoken file name
`cl100k_base`	GPT-4, GPT-3.5-turbo, text-embedding-ada-002, text-embedding-3-small, text-embedding-3-large	`cl100k_base.tiktoken`
`o200k_base`	GPT-4o, GPT-4o-mini	`o200k_base.tiktoken`
`p50k_base`	text-davinci-003, text-davinci-002, code-davinci-002	`p50k_base.tiktoken`
`p50k_edit`	text-davinci-edit-001, code-davinci-edit-001	`p50k_base.tiktoken`
`r50k_base`	text-davinci-001, text-curie-001, text-babbage-001, text-ada-001	`r50k_base.tiktoken`

Usage Examples

Token Counting for API Cost Estimation

encoding = Riktoken.encoding_for_model("gpt-4")
text = "Your prompt here..."
token_count = encoding.encode(text).length

# Estimate API cost (example rates)
input_cost_per_1k = 0.03  # $0.03 per 1K tokens
estimated_cost = (token_count / 1000.0) * input_cost_per_1k
puts "Token count: #{token_count}"
puts "Estimated cost: $#{'%.4f' % estimated_cost}"

Handling Special Tokens

encoding = Riktoken.get_encoding("cl100k_base")

# By default, special tokens raise an error
begin
  tokens = encoding.encode("Hello <|endoftext|> world")
rescue Riktoken::Encoding::DisallowedSpecialTokenError
  puts "Special tokens not allowed!"
end

# Allow specific special tokens
tokens = encoding.encode("Hello <|endoftext|> world", allowed_special: ["<|endoftext|>"])

# Allow all special tokens
tokens = encoding.encode("Hello <|endoftext|> world", allowed_special: "all")

Splitting Text by Token Limit

def split_by_tokens(text, max_tokens, encoding)
  tokens = encoding.encode(text)
  chunks = []

  tokens.each_slice(max_tokens) do |chunk|
    chunks << encoding.decode(chunk)
  end

  chunks
end

# Example: Split text into 100-token chunks
encoding = Riktoken.get_encoding("cl100k_base")
chunks = split_by_tokens("Your long text here...", 100, encoding)

List Available Encodings and Models

# List all available encodings
puts Riktoken.list_encoding_names
# => ["cl100k_base", "o200k_base", "p50k_base", "p50k_edit", "r50k_base"]

# List all supported models
puts Riktoken.list_model_names
# => ["gpt-4", "gpt-3.5-turbo", "text-davinci-003", ...]

Advanced Usage

Custom Encodings

# Make a custom encoding
encoding = Riktoken.make_encoding(
  name: "my_custom_encoding",
  ranks: {"hello" => 0, "world" => 1},
  special_tokens: {"<|custom|>" => 100},
  pattern: /\w+/
)

tokens = encoding.encode('hello, world')

Loading from Custom .tiktoken File

encoding = Riktoken.encoding_from_file(
  path: "path/to/custom.tiktoken",
  name: "custom_encoding",
  special_tokens: {"<|special|>" => 50000},
  pattern: /'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}++|\p{N}{1,3}+| ?[^\s\p{L}\p{N}]++[\r\n]*+|\s++$|\s*[\r\n]|\s+(?!\S)|\s/
)

Precedents

IAPark/tiktoken_ruby is a Ruby port of OpenAI's tiktoken library uses native extensions. This would be a good choice if you need a faster implementation with native performance.

Development

After checking out the repo, run bin/setup to install dependencies. Then, run rake test to run the tests. You can also run bin/console for an interactive prompt that will allow you to experiment.

To install this gem onto your local machine, run bundle exec rake install. To release a new version, update the version number in version.rb, and then run bundle exec rake release, which will create a git tag for the version, push git commits and the created tag, and push the .gem file to rubygems.org.

Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/moznion/riktoken. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the code of conduct.

License

The gem is available as open source under the terms of the MIT License.

riktoken

Runtime