Project

myasorubka

0.0
Repository is archived
No commit activity in last 3 years
No release in over 3 years
Myasorubka is a morphological data processor.
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
 Dependencies
 Project Readme

Myasorubka

Myasorubka is a morphological data processor that supports AOT and MULTEXT-East notations.

Gem Version Build Status

MULTEXT-East morphosyntactic descriptors

It is possible to process the MULTEXT-East morphosyntactic descriptors (MSDs) in a convenient way.

Myasorubka provides predefined morphosyntactic specifications that have been based on MULTEXT-East resources Version 4 for English and Russian languages.

It is possible to parse MSDs using the Myasorubka::MSD class.

>> require 'myasorubka/msd/russian'
=> true
>> msd = Myasorubka::MSD.new(Myasorubka::MSD::Russian, 'Ncnpdy')
=> #<Myasorubka::MSD::Russian msd="Ncnpdy">
>> msd.pos
=> :noun
>> msd.grammemes
=> {:type=>:common, :gender=>:neuter, :number=>:plural, :case=>:dative, :animate=>:yes}

You would be notified if the given MSD is invalid.

>> msd = Myasorubka::MSD.new(Myasorubka::MSD::Russian, 'Sasai')
Myasorubka::MSD::InvalidDescriptor: Sasai

Also, the Myasorubka::MSD class allows to write MSDs.

>> msd = Myasorubka::MSD.new(Myasorubka::MSD::Russian)
=> #<Myasorubka::MSD::Russian msd="">
>> msd.pos = :verb
=> :verb
>> msd[:type] = :main
=> :main
>> msd[:definiteness] = :full_art
=> :full_art
>> msd
=> #<Myasorubka::MSD::Russian msd="Vm------f">
>> msd.to_s
=> "Vm------f"

AOT dictionaries

Myasorubka provides simple parsers for lexicon in the AOT format, both for gramtab and dictionary files.

>> require 'myasorubka/aot'
=> true
>> mrd = Myasorubka::AOT::Dictionary.new('morphs.mrd', :russian, 'CP1251')
=> #<Myasorubka::AOT::Dictionary filename="morphs.mrd" language=:russian>
>> tab = Myasorubka::AOT::Gramtab.new('rgramtab.tab', 'CP1251')
=> #<Myasorubka::AOT::Gramtab filename="rgramtab.tab" language=nil>

Not it's pretty easy to extract surnames with their word forms from the parsed lexicon.

>> ancodes = tab.ancodes.
?>   map { |k, h| [k, h[:grammemes].split(',').compact] }.
?>   select { |_, g| g.include? 'фам' }.map(&:first)
=> ["Уы"]
>> lemmas = mrd.lemmas.
?>   select { |_, _, _, _, ancode, _| ancodes.include? ancode }
=> [["ААРОН", 28, 22, 5, "Уы", nil], ["АБАЗЕВ", 33, 27, 1, "Уы", nil], ...]
>> lemmas.each do |stem, rule_id, *_|
?>   mrd.rules[rule_id].each do |suffix, ancode, prefix|
?>     puts [prefix, stem, suffix].join
?>   end
?> end
ААРОН
ААРОНА
ААРОНУ
ААРОНА
...
ЯЩУКОВ
ЯЩУКАМ
ЯЩУКАМИ
ЯЩУКАХ

You can learn more about AOT lexicon from the correspondent whitepaper.

Contributing

  1. Fork it;
  2. Create your feature branch (git checkout -b my-new-feature);
  3. Commit your changes (git commit -am 'Added some feature');
  4. Push to the branch (git push origin my-new-feature);
  5. Create new Pull Request.

Copyright

Copyright (c) 2011–2019 Dmitry Ustalov. See LICENSE for details.