Project

mormor

0.0
No release in over a year
Morfologik dictionaries client in pure Ruby: POS tagging & spellcheck
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
 Dependencies

Development

Runtime

>= 3.15.0
 Project Readme

MorMor

Gem Version

MorMor is pure Ruby morfologik dictionary client that could be used for POS (part of speech) tagging and simplistic spellchecking. Morfologik format's distinguishing feature is it is primary dictionary format for LanguageTool, therefore a lot of ready high-quality dictionaries exist.

Features/Problems

  • No dependencies¹, pure Ruby
  • Fast: I don't have any detailed numbers, but naive test on my laptop shows 3 mln lookups/second on a very large dictionary (Polish, several million word forms).
  • Relatively memory-efficient: Typical dictionary file size is 1-3 Mb, mormor just loads it into memory as bytes (e.g. each byte => Ruby Integer) and that's all memory it needs.
  • Dictionaries for a lot of languages already exist: unlike your typical POS tagger, usage instructions does not start with "First, take your corpora and train the tagger as you please" (see "Dictionaries" section).
  • To the moment, it is just a naive port of original Morfologik Java code, but it works with all the dictionaries I could find:
    • Of possible dictionary formats, only FSA5 and CFSA2 are implemented (not CFSA);
    • Of possible dictionary "encoders", only "SUFFIX" and "PREFIX" are implemented;
  • No tests/specs, but it works (and checked thoroughly with existing dictionaries); TBH, original Morfologik doesn't have much, either;
  • Morfologik's spellchecker suggestions/candidates are not ported, so mormor can be used only for "sanity" spellchecking ("this word is/is not in the dictionary")

¹The only runtime dependency is backports and that's only because I am too fond of modern Ruby features to sacrifice them to "no-dependencies" god.

Usage

  1. Install mormor gem (via bundler or just [sudo] gem install mormor)
  2. Take a dictionary for your language (see "Dictionaries" section below)
  3. Now...
require 'mormor'

dictionary = MorMor::Dictionary.new('path/to/english')
dictionary.lookup('meowing')
# => [#<struct MorMor::Dictionary::Word stem="meow", tags="VBG">]
dictionary.lookup('barks')
# => [#<struct MorMor::Dictionary::Word stem="bark", tags="NNS">,
#     #<struct MorMor::Dictionary::Word stem="bark", tags="VBZ">]
dictionary.lookup('borogoves')
# = nil

dictionary = MorMor::Dictionary.new('path/to/ukrainian')
dictionary.lookup("солов'їна")
# => [#<struct MorMor::Dictionary::Word stem="солов'їний", tags="adj:f:v_kly">,
#     #<struct MorMor::Dictionary::Word stem="солов'їний", tags="adj:f:v_naz">]

Dictionary#lookup returns an array of structs which describe all possible base forms + part of speech /word form tags. (For example, "barks" could be a third person form of the verb "to bark", or plural form of noun "bark".)

Tags are dependent on the particular dictionary used and typically documented in a free form alongside the dictionaries.

Dictionaries

A lot of dictionaries in Morfologik format could be found at LanguageTool's repo. For example, for Polish language, dictionary is at languagetool-language-modules/pl/src/main/resources/org/languagetool/resource/pl/.

What you need there, are:

  • polish.dict is a dictionary (binary finite-state-automata) itself
  • polish.info is dictionary metadata

In order to use Polish dictionary with mormor, you need to place both files at the same folder, and then

pl = MorMor::Dictionary.new('path/to/that/folder/polish') # without extension
pl.lookup('świetnie')

You may also be interested in tagset.txt file of the same folder, which has an explanation for all POS/forms tags in natural language (Polish language, for that case).

Sometimes (for example, in case of German and Ukrainian), LanguageTool repo contains not the dictionary itself, but a link to other repo/site where it can be downloaded.

Please carefully consider dictionary licenses when using them!

Note: mormor repo contains copies of dictionary files from LanguageTool and referred projects, but they are not a part of the gem distribution and only used for testing the parser/lookup correctness, and demonstration purposes.

License and credits

Most of the credit for algorithms and original code belong to original Morfologik's authors, and author of paper's they based their work on.

Ruby version is done by Victor Shepelev.

The license is BSD, the same as the original Morfologik.