0.02
No commit activity in last 3 years
No release in over 3 years
Ruby port of TinySegmenter.js for tokenizing Japanese text. Uses a Naive Bayes model that has been trained using the RWCP corpus and optimized using L1-norm regularization. The resultant model is quite compact, yet has a 95% accuracy rate.
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
 Dependencies

Development

~> 10.4
~> 3.3
 Project Readme

Ruby port of TinySegmenter.js for tokenizing Japanese text. Ruby 1.9 or higher required.

Build Status

Install

gem install tiny_segmenter or add tiny_segmenter to your Gemfile

Usage

ts = TinySegmenter.new
ts.segment("今晩は!良い天気ですね。")
# => ["今晩", "は", "!", "良い", "天気", "です", "ね", "。"]
ts.segment("今晩は!良い天気ですね。", ignore_punctuation: true)
# => ["今晩", "は", "良い", "天気", "です", "ね"]

Input text should be UTF-8 encoded.

How it works

The Naive Bayes model was trained using the RWCP corpus and optimized using L1-norm regularization (e.g. this). The resultant model is quite compact, yet (according to the author) has about a 95% accuracy rate.

License

BSD - see http://chasen.org/~taku/software/TinySegmenter/LICENCE.txt