Myasorubka

Myasorubka is a morphological data processor that supports AOT and MULTEXT-East notations.

MULTEXT-East morphosyntactic descriptors

It is possible to process the MULTEXT-East morphosyntactic descriptors (MSDs) in a convenient way.

Myasorubka provides predefined morphosyntactic specifications that have been based on MULTEXT-East resources Version 4 for English and Russian languages.

It is possible to parse MSDs using the Myasorubka::MSD class.

>> require 'myasorubka/msd/russian'
=> true
>> msd = Myasorubka::MSD.new(Myasorubka::MSD::Russian, 'Ncnpdy')
=> #<Myasorubka::MSD::Russian msd="Ncnpdy">
>> msd.pos
=> :noun
>> msd.grammemes
=> {:type=>:common, :gender=>:neuter, :number=>:plural, :case=>:dative, :animate=>:yes}

You would be notified if the given MSD is invalid.

>> msd = Myasorubka::MSD.new(Myasorubka::MSD::Russian, 'Sasai')
Myasorubka::MSD::InvalidDescriptor: Sasai

Also, the Myasorubka::MSD class allows to write MSDs.

>> msd = Myasorubka::MSD.new(Myasorubka::MSD::Russian)
=> #<Myasorubka::MSD::Russian msd="">
>> msd.pos = :verb
=> :verb
>> msd[:type] = :main
=> :main
>> msd[:definiteness] = :full_art
=> :full_art
>> msd
=> #<Myasorubka::MSD::Russian msd="Vm------f">
>> msd.to_s
=> "Vm------f"

AOT dictionaries

Myasorubka provides simple parsers for lexicon in the AOT format, both for gramtab and dictionary files.

>> require 'myasorubka/aot'
=> true
>> mrd = Myasorubka::AOT::Dictionary.new('morphs.mrd', :russian, 'CP1251')
=> #<Myasorubka::AOT::Dictionary filename="morphs.mrd" language=:russian>
>> tab = Myasorubka::AOT::Gramtab.new('rgramtab.tab', 'CP1251')
=> #<Myasorubka::AOT::Gramtab filename="rgramtab.tab" language=nil>

Not it's pretty easy to extract surnames with their word forms from the parsed lexicon.

>> ancodes = tab.ancodes.
?>   map { |k, h| [k, h[:grammemes].split(',').compact] }.
?>   select { |_, g| g.include? 'фам' }.map(&:first)
=> ["Уы"]
>> lemmas = mrd.lemmas.
?>   select { |_, _, _, _, ancode, _| ancodes.include? ancode }
=> [["ААРОН", 28, 22, 5, "Уы", nil], ["АБАЗЕВ", 33, 27, 1, "Уы", nil], ...]
>> lemmas.each do |stem, rule_id, *_|
?>   mrd.rules[rule_id].each do |suffix, ancode, prefix|
?>     puts [prefix, stem, suffix].join
?>   end
?> end

ААРОН
ААРОНА
ААРОНУ
ААРОНА
...
ЯЩУКОВ
ЯЩУКАМ
ЯЩУКАМИ
ЯЩУКАХ

You can learn more about AOT lexicon from the correspondent whitepaper.

Contributing

Fork it;
Create your feature branch (git checkout -b my-new-feature);
Commit your changes (git commit -am 'Added some feature');
Push to the branch (git push origin my-new-feature);
Create new Pull Request.

myasorubka

Myasorubka

MULTEXT-East morphosyntactic descriptors

AOT dictionaries

Contributing

Copyright