Kabosu

Ruby bindings for sudachi.rs, a Rust implementation of the Sudachi Japanese morphological analyzer.

Usage

require "kabosu"

# Explicit dictionary + tokenizer lifecycle
dict = Kabosu::Dictionary.new(system_dict: Kabosu::Dictionary.path)
tokenizer = dict.create(mode: :c)

# Tokenize Japanese text
morphemes = Kabosu.tokenize("東京都に住んでいる", tokenizer: tokenizer)

# Bulk accessors for quick extraction
morphemes.surfaces          # => ["東京都", "に", "住ん", "で", "いる"]
morphemes.readings          # => ["トウキョウト", "ニ", "スン", "デ", "イル"]
morphemes.dictionary_forms  # => ["東京都", "に", "住む", "で", "居る"]

# Each morpheme exposes rich linguistic detail
morpheme = morphemes.first
morpheme.surface             # => "東京都"          - surface form (as it appears in text)
morpheme.part_of_speech      # => ["名詞", "固有名詞", "地名", "一般"] — part-of-speech tags
morpheme.part_of_speech_id   # => 5                - numeric POS id
morpheme.dictionary_form     # => "東京都"          - base/dictionary form
morpheme.normalized_form     # => "東京都"          - normalized form
morpheme.reading_form        # => "トウキョウト"     - phonetic reading
morpheme.oov?                # => false            - out-of-vocabulary?
morpheme.dictionary_id       # => 0                - source dictionary id
morpheme.word_id             # => 544373           - internal word id
morpheme.synonym_group_ids   # => []               - synonym group ids
morpheme.dictionary_form_word_id # => -1           - dictionary-form word id
morpheme.head_word_length    # => 3                - head word length in codepoints
morpheme.a_unit_split        # => [123, 456]       - split-A word ids
morpheme.b_unit_split        # => []               - split-B word ids
morpheme.word_structure      # => [123, 456]       - word-structure ids
morpheme.total_cost          # => 5765             - morphological analysis cost
morpheme.begin               # => 0                - start byte offset
morpheme.end                 # => 9                - end byte offset
morpheme.begin_c             # => 0                - start character offset
morpheme.end_c               # => 3                - end character offset
morpheme.system?             # => true             - from system dictionary?
morpheme.user?               # => false            - from user dictionary?

# Split text into natural Japanese sentence boundaries
Kabosu.split_sentences("東京都に住んでいる。大阪も好きだ。")
# => ["東京都に住んでいる。", "大阪も好きだ。"]

Installation

Ruby >= 3.1
Rust toolchain (for compiling the native extension)

Add to your Gemfile:

gem "kabosu"

Then install and download a Sudachi dictionary:

bundle install
bundle exec rake kabosu:install[small]  # or core, full

Dictionary editions (from smallest to largest): small, core, full. See the SudachiDict documentation for details on the differences between editions.

Dictionary management

Rake tasks for managing Sudachi dictionaries:

rake kabosu:install[small]      # Install a dictionary (VERSION=YYYYMMDD for a specific version)
rake kabosu:install_if_missing  # Same, but a no-op when a dictionary is already installed
rake kabosu:list                # List installed dictionaries
rake kabosu:versions            # Show available versions from GitHub
rake kabosu:path                # Show path to best available dictionary
rake kabosu:remove[small]       # Remove a dictionary (VERSION=YYYYMMDD for a specific version)

Dictionaries are stored in ~/.kabosu/dict/ by default. Set KABOSU_DICT_DIR to customize — useful for pointing at a Docker volume so the dictionary persists across deployments.

In a Rails app, the rake tasks are auto-loaded via railtie — no manual load needed. For container entrypoints, rake kabosu:install_if_missing converges on the desired state without hitting the network on subsequent runs.

Tokenization modes

Sudachi provides three split modes:

Mode	Description
`A`	Short units (most granular)
`B`	Middle units
`C`	Named entity units (default)

dict = Kabosu::Dictionary.new(system_dict: Kabosu::Dictionary.path)
tok_a = dict.create(mode: :a)
tok_c = dict.create(mode: :c)
tok_a.tokenize("東京都").surfaces  # => ["東京", "都"]
tok_c.tokenize("東京都").surfaces  # => ["東京都"]

Modes are symbols only (:a, :b, :c or Kabosu::MODE_A/B/C).

Advanced Use Cases

# Custom system dictionary + optional user dictionaries
dict = Kabosu::Dictionary.new(
  system_dict: "/path/to/custom/system.dic",
  user_dicts: ["/path/to/domain.dic", "/path/to/names.dic"]
)

# Create tokenizer with explicit mode/fields
tokenizer = dict.create(mode: :c, fields: %i[surface pos_id reading_form])

# Tokenize (returns MorphemeList; lazily hydrates morphemes)
list = tokenizer.tokenize("国会議事堂前駅")
list.surfaces
list.first.part_of_speech

# Dictionary prefix lookup
dict.lookup("東京都").surfaces

# Morpheme split
m = tokenizer.tokenize("東京都").first
m.split(mode: :a).surfaces

# Sentence splitting
Kabosu.split_sentences("東京都に住んでいる。大阪も好きだ。", ranges: true)
Kabosu.split_sentences("長い文...", limit: 12, with_checker: true)

Benchmarks

Kabosu ships with a benchmark suite that measures tokenization throughput and compares the Ruby bindings against raw sudachi.rs.

This benchmark uses Wagahai wa Neko de Aru (I Am a Cat) by Natsume Soseki, sourced from Aozora Bunko (public domain) as the source text. ~958 KB of Japanese prose, 2,256 lines as input.

Results

Measured on an AMD Ryzen 7 5800X, full dictionary edition, Ruby 3.4, Rust 1.84:

Single-thread (10 iterations):

Scenario	Rust	Ruby	Ratio
split_sentences	1.550s	1.615s	1.0x
tokenize (mode C)	3.148s	3.395s	1.1x
tokenize (mode A)	3.227s	3.525s	1.1x
tokenize (mode B)	3.226s	3.582s	1.1x
Throughput	2.94 MB/s	2.69 MB/s	1.1x

Multithread (8 threads x 20,000 requests):

Scenario	Rust	Ruby	Ratio
rails-style shared tokenizer	1.475s	2.101s	1.4x
tokenizer per thread	1.381s	2.154s	1.6x
Throughput ST	20.44 MB/s	14.35 MB/s	1.4x
Throughput PT	21.84 MB/s	14.00 MB/s	1.6x

Notes:

shared tokenizer matches Rails-style access where all request threads call one tokenizer instance.
per thread creates one tokenizer per worker thread.
Ratios are Ruby / Rust, and values vary by CPU, Ruby version, and dictionary edition.

To reproduce these results, run:

bundle exec ruby bench/start

To generate flamegraph SVGs alongside the benchmark:

bundle exec ruby bench/start --profile

This records both the Rust and Ruby runs with perf and produces interactive SVGs (bench/flamegraph-rust.svg, bench/flamegraph-ruby.svg). Open them in a browser to explore.

Contributing

bundle install

bundle exec rake kabosu:install # Install Sudachi dictionary

bundle exec rake compile        # Build the native extension  
bundle exec rake test           # Run tests

bench/start                     # Run benchmarks

kabosu

Development

Runtime