GRC - Ancient Greek Methods for Ruby
Several problems can come up when using unicode greek characters. This gem solves some of them.
Installation
Install the gem and add to the application's Gemfile by executing:
$ bundle add grc
If bundler is not being used to manage dependencies, install the gem by executing:
$ gem install grc
Usage
require 'grc'
General methods
grc?
(str → bool)
String contains greek letters? This method will check and return true
or false
.
irb(main):001:0> 'Μῆνιν ἄειδε θεὰ Πηληϊάδεω Ἀχιλῆος'.grc?
=> true
irb(main):002:0> 'Greekless sentence'.grc?
=> false
tokenize
(str → array)
This method will tokenize a string; i.e., return an array of objects such as words and punctuation marks.
irb(main):003:0> 'Πάντες ἄνθρωποι τοῦ εἰδέναι ὀρέγονται φύσει. σημεῖον δ᾽ ἡ τῶν αἰσθήσεων ἀγάπησις· καὶ γὰρ χωρὶς τῆς χρείας ἀγαπῶνται δι᾽ αὑτάς, καὶ μάλιστα τῶν ἄλλων ἡ διὰ τῶν ὀμμάτων.'.tokenize
=> ["Πάντες", "ἄνθρωποι", "τοῦ", "εἰδέναι", "ὀρέγονται", "φύσει", ".", "σημεῖον", "δ᾽", "ἡ",
"τῶν", "αἰσθήσεων", "ἀγάπησις", "·", "καὶ", "γὰρ", "χωρὶς", "τῆς", "χρείας", "ἀγαπῶνται",
"δι᾽", "αὑτάς", ",", "καὶ", "μάλιστα", "τῶν", "ἄλλων", "ἡ", "διὰ", "τῶν", "ὀμμάτων", "."]
irb(main):004:0> 'Μῆνιν ἄειδε θεὰ Πηληϊάδεω Ἀχιλῆος'.tokenize
=> ["Μῆνιν", "ἄειδε", "θεὰ", "Πηληϊάδεω", "Ἀχιλῆος"]
transliterate
(str → str)
This is highly experimental method to transliterate greek into latin letters. Users are likely to encounter bugs and edge-cases. Please, report them.
irb(main):005:0> 'Μῆνιν ἄειδε θεὰ Πηληϊάδεω Ἀχιλῆος'.transliterate
=> "mēnin aeide thea pēlēiadeō achilēos"
irb(main):006:0> 'Πάντες ἄνθρωποι τοῦ εἰδέναι ὀρέγονται φύσει'.transliterate
=> "pantes anthrōpoi tou eidenai oregontai physei"
Unicode Inspection Methods
unicode_points
(str → array)
This method will return an array with unicode points (the Unicode mapping) of every character in the string.
irb(main):008:0> 'θεὰ'.unicode_points
=> ["\\u03B8", "\\u03B5", "\\u1F70"]
hash_dump
: (str → hash)
Same as unicode_points
, but returns a hash. Still experimental.
irb(main):009:0> str.hash_dump
=> {"ἄ"=>"\"\\u1F04\"", "ε"=>"\"\\u03B5\"", "ι"=>"\"\\u03B9\"", "δ"=>"\"\\u03B4\""}
unicode_name
(str → array)
This method will return an array with the unicode name of each character in the string.
irb(main):010:0> 'θεὰ'.unicode_name
=> ["GREEK SMALL LETTER THETA", "GREEK SMALL LETTER EPSILON", "GREEK SMALL LETTER ALPHA WITH VARIA"]
Unicode Normalization
Unicode Normalization is exceptionally important for Greek texts. It is used to normalize the text to a standard form, which is used by the computer to compare texts and for performing searches in a database.
nfd
: Canonical Decomposition (str → str)
This methods will decompose a string into its parts using the canonical decomposition method. This is useful for preparing a string to be used in searches. It will never damage the text by performing irreparable changes: a string can be recomposed using the canonical composition at any time.
This is our test string. Pay attention to the first character.
irb(main):011:0> str = 'ἄνθρωπος'
=> "ἄνθρωπος"
irb(main):012:0> str.unicode_name
=>
["GREEK SMALL LETTER ALPHA WITH PSILI AND OXIA",
"GREEK SMALL LETTER NU",
"GREEK SMALL LETTER THETA",
"GREEK SMALL LETTER RHO",
"GREEK SMALL LETTER OMEGA",
"GREEK SMALL LETTER PI",
"GREEK SMALL LETTER OMICRON",
"GREEK SMALL LETTER FINAL SIGMA"]
Now, we decomposed the precomposed unicode characters.
irb(main):013:0> str = str.nfd
=> "ἄνθρωπος"
irb(main):014:0> str.unicode_name
=>
["GREEK SMALL LETTER ALPHA",
"COMBINING COMMA ABOVE",
"COMBINING ACUTE ACCENT",
"GREEK SMALL LETTER NU",
"GREEK SMALL LETTER THETA",
"GREEK SMALL LETTER RHO",
"GREEK SMALL LETTER OMEGA",
"GREEK SMALL LETTER PI",
"GREEK SMALL LETTER OMICRON",
"GREEK SMALL LETTER FINAL SIGMA"]
Notice how ἄ
(GREEK SMALL LETTER ALPHA WITH PSILI AND OXIA
) becomes α
(GREEK SMALL LETTER ALPHA
), ̓
(COMBINING COMMA ABOVE
), ́
(COMBINING ACUTE ACCENT
). If we decompose a string and then try to match a query against it, there will be no need to get the diacritics right and we'll only need the base-character.
nfc
(str → str)
Using the result string from the last example, we can compose the characters back into its precomposed form. α
(alpha), ̓
(smooth breathing), ́
(acute accent) will be composed back into a single character, that is, ἄ
(alpha with breathing and acute accent).
irb(main):015:0> str = str.nfc
=> "ἄνθρωπος"
irb(main):016:0> str.unicode_name
=>
["GREEK SMALL LETTER ALPHA WITH PSILI AND OXIA",
"GREEK SMALL LETTER NU",
"GREEK SMALL LETTER THETA",
"GREEK SMALL LETTER RHO",
"GREEK SMALL LETTER OMEGA",
"GREEK SMALL LETTER PI",
"GREEK SMALL LETTER OMICRON",
"GREEK SMALL LETTER FINAL SIGMA"]
Diacritical marks
This is our example string for the next 3 methods.
irb(main):017:0> str = 'Μῆνιν ἄειδε θεὰ Πηληϊάδεω Ἀχιλῆος'
=> "Μῆνιν ἄειδε θεὰ Πηληϊάδεω Ἀχιλῆος"
no_downcase_diacritics
(str → str)
Remove from lowercase characters.
irb(main):018:0> str.no_downcase_diacritics
=> "Μηνιν αειδε θεα Πηληιαδεω Ἀχιληος"
no_upcase_diacritics
(str → str)
Remove from uppercase characters.
irb(main):019:0> str.no_upcase_diacritics
=> "Μῆνιν ἄειδε θεὰ Πηληϊάδεω Αχιλῆος"
no_diacritics
(str → str)
Remove from all characters.
irb(main):020:0> str.no_diacritics
=> "Μηνιν αειδε θεα Πηληιαδεω Αχιληος"
Accents
to_grave
(str → str)
Change the acute for a grave accent. Alternative name: tonos_to_grave
irb(main):021:0> str = str.to_grave
=> "θεὰ"
to_acute
(str → str)
Change the grave for an acute accent. Alternative: grave_to_acute
irb(main):022:0> str = str.to_acute
=> "θεά"
to_oxia
(str → str)
Tonos → Oxia. This should also be self-explanatory, but only if one is aware of the existence of two different types of acute accent for Greek letters in the Unicode system. If you didn't know, now you do.
The tonos
was created when Greece adopted the monotonic system. It was considered a new kind of diacritical mark. Later, this changed and everyone agreed that it is, in fact, no different from the acute accent of polytonic greek.
When the Greek Extended Character Set was created specifically for polytonic Greek, however, another character was introduced to represent the acute accent. This character is called oxia
.
The end result? Both characters are visually impossible to distinguish. The tonos
is now the same as the oxia
and the standard way to represent the acute accent when it is the only diacritical mark of the base character. Whenever you are typing, if you include other diacritics, it will automatically turn into an oxia
. But, keep in mind that they have different code points, so one won't match against the other.
irb(main):023:0> str = str.to_oxia
=> "θεά"
irb(main):024:0> str.unicode_name
=> ["GREEK SMALL LETTER THETA", "GREEK SMALL LETTER EPSILON", "GREEK SMALL LETTER ALPHA WITH OXIA"]
to_tonos
(str → str)
Oxia → Tonos. If in doubt about whether to use oxia
or tonos
, the correct answer is tonos
. So, use this methods to convert an oxia
to a tonos
, in all cases where it should be used.
irb(main):025:0> str = str.to_tonos
=> "θεά"
irb(main):026:0> str.unicode_name
=> ["GREEK SMALL LETTER THETA", "GREEK SMALL LETTER EPSILON", "GREEK SMALL LETTER ALPHA WITH TONOS"]
See also
Development
After checking out the repo, run bin/setup
to install dependencies. Then, run rake test
to run the tests. You can also run bin/console
for an interactive prompt that will allow you to experiment.
To install this gem onto your local machine, run bundle exec rake install
. To release a new version, update the version number in version.rb
, and then run bundle exec rake release
, which will create a git tag for the version, push git commits and the created tag, and push the .gem
file to rubygems.org.
Contributing
Bug reports and pull requests are welcome on GitHub at https://github.com/bcdav/grc. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the code of conduct.
License
The gem is available as open source under the terms of the MIT License.
Code of Conduct
Everyone interacting in the Grc project's codebases, issue trackers, chat rooms and mailing lists is expected to follow the code of conduct.