Project

kabosu

0.0
A long-lived project that still receives updates
Kabosu provides Ruby bindings for sudachi.rs, a Rust implementation of the Sudachi Japanese morphological analyzer.
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
 Dependencies

Development

~> 5.0
~> 13.0

Runtime

~> 0.9
 Project Readme

Kabosu

Kabosu

Gem Version CI License Downloads

Ruby bindings for sudachi.rs, a Rust implementation of the Sudachi Japanese morphological analyzer.

Usage

require "kabosu"

# Explicit dictionary + tokenizer lifecycle
dict = Kabosu::Dictionary.new(system_dict: Kabosu::Dictionary.path)
tokenizer = dict.create(mode: :c)

# Tokenize Japanese text
morphemes = Kabosu.tokenize("東京都に住んでいる", tokenizer: tokenizer)

# Bulk accessors for quick extraction
morphemes.surfaces          # => ["東京都", "に", "住ん", "で", "いる"]
morphemes.readings          # => ["トウキョウト", "ニ", "スン", "デ", "イル"]
morphemes.dictionary_forms  # => ["東京都", "に", "住む", "で", "居る"]

# Each morpheme exposes rich linguistic detail
morpheme = morphemes.first
morpheme.surface             # => "東京都"          - surface form (as it appears in text)
morpheme.part_of_speech      # => ["名詞", "固有名詞", "地名", "一般"] — part-of-speech tags
morpheme.part_of_speech_id   # => 5                - numeric POS id
morpheme.dictionary_form     # => "東京都"          - base/dictionary form
morpheme.normalized_form     # => "東京都"          - normalized form
morpheme.reading_form        # => "トウキョウト"     - phonetic reading
morpheme.oov?                # => false            - out-of-vocabulary?
morpheme.dictionary_id       # => 0                - source dictionary id
morpheme.word_id             # => 544373           - internal word id
morpheme.synonym_group_ids   # => []               - synonym group ids
morpheme.dictionary_form_word_id # => -1           - dictionary-form word id
morpheme.head_word_length    # => 3                - head word length in codepoints
morpheme.a_unit_split        # => [123, 456]       - split-A word ids
morpheme.b_unit_split        # => []               - split-B word ids
morpheme.word_structure      # => [123, 456]       - word-structure ids
morpheme.total_cost          # => 5765             - morphological analysis cost
morpheme.begin               # => 0                - start byte offset
morpheme.end                 # => 9                - end byte offset
morpheme.begin_c             # => 0                - start character offset
morpheme.end_c               # => 3                - end character offset
morpheme.system?             # => true             - from system dictionary?
morpheme.user?               # => false            - from user dictionary?

# Split text into natural Japanese sentence boundaries
Kabosu.split_sentences("東京都に住んでいる。大阪も好きだ。")
# => ["東京都に住んでいる。", "大阪も好きだ。"]

Installation

  • Ruby >= 3.1
  • Rust toolchain (for compiling the native extension)

Add to your Gemfile:

gem "kabosu"

Then install and download a Sudachi dictionary:

bundle install
bundle exec rake kabosu:install[small]  # or core, full

Dictionary editions (from smallest to largest): small, core, full. See the SudachiDict documentation for details on the differences between editions.

Dictionary management

Rake tasks for managing Sudachi dictionaries:

rake kabosu:install[small]     # Install a dictionary (VERSION=YYYYMMDD for a specific version)
rake kabosu:list               # List installed dictionaries
rake kabosu:versions           # Show available versions from GitHub
rake kabosu:path               # Show path to best available dictionary
rake kabosu:remove[small]      # Remove a dictionary (VERSION=YYYYMMDD for a specific version)

Dictionaries are stored in ~/.kabosu/dict/ by default. Set KABOSU_DICT_DIR to customize.

Tokenization modes

Sudachi provides three split modes:

Mode Description
A Short units (most granular)
B Middle units
C Named entity units (default)
dict = Kabosu::Dictionary.new(system_dict: Kabosu::Dictionary.path)
tok_a = dict.create(mode: :a)
tok_c = dict.create(mode: :c)
tok_a.tokenize("東京都").surfaces  # => ["東京", "都"]
tok_c.tokenize("東京都").surfaces  # => ["東京都"]

Modes are symbols only (:a, :b, :c or Kabosu::MODE_A/B/C).

Advanced Use Cases

# Custom system dictionary + optional user dictionaries
dict = Kabosu::Dictionary.new(
  system_dict: "/path/to/custom/system.dic",
  user_dicts: ["/path/to/domain.dic", "/path/to/names.dic"]
)

# Create tokenizer with explicit mode/fields
tokenizer = dict.create(mode: :c, fields: %i[surface pos_id reading_form])

# Tokenize (returns MorphemeList; lazily hydrates morphemes)
list = tokenizer.tokenize("国会議事堂前駅")
list.surfaces
list.first.part_of_speech

# Dictionary prefix lookup
dict.lookup("東京都").surfaces

# Morpheme split
m = tokenizer.tokenize("東京都").first
m.split(mode: :a).surfaces

# Sentence splitting
Kabosu.split_sentences("東京都に住んでいる。大阪も好きだ。", ranges: true)
Kabosu.split_sentences("長い文...", limit: 12, with_checker: true)

Benchmarks

Kabosu ships with a benchmark suite that measures tokenization throughput and compares the Ruby bindings against raw sudachi.rs.

This benchmark uses Wagahai wa Neko de Aru (I Am a Cat) by Natsume Soseki, sourced from Aozora Bunko (public domain) as the source text. ~958 KB of Japanese prose, 2,256 lines as input.

Results

Measured on an AMD Ryzen 7 5800X, full dictionary edition, Ruby 3.4, Rust 1.84:

Single-thread (10 iterations):

Scenario Rust Ruby Ratio
split_sentences 1.550s 1.615s 1.0x
tokenize (mode C) 3.148s 3.395s 1.1x
tokenize (mode A) 3.227s 3.525s 1.1x
tokenize (mode B) 3.226s 3.582s 1.1x
Throughput 2.94 MB/s 2.69 MB/s 1.1x

Multithread (8 threads x 20,000 requests):

Scenario Rust Ruby Ratio
rails-style shared tokenizer 1.475s 2.101s 1.4x
tokenizer per thread 1.381s 2.154s 1.6x
Throughput ST 20.44 MB/s 14.35 MB/s 1.4x
Throughput PT 21.84 MB/s 14.00 MB/s 1.6x

Notes:

  • shared tokenizer matches Rails-style access where all request threads call one tokenizer instance.
  • per thread creates one tokenizer per worker thread.
  • Ratios are Ruby / Rust, and values vary by CPU, Ruby version, and dictionary edition.

To reproduce these results, run:

bundle exec ruby bench/start

To generate flamegraph SVGs alongside the benchmark:

bundle exec ruby bench/start --profile

This records both the Rust and Ruby runs with perf and produces interactive SVGs (bench/flamegraph-rust.svg, bench/flamegraph-ruby.svg). Open them in a browser to explore.

Contributing

bundle install

bundle exec rake kabosu:install # Install Sudachi dictionary

bundle exec rake compile        # Build the native extension  
bundle exec rake test           # Run tests

bench/start                     # Run benchmarks