Project

encode_m

0.0
No release in over 3 years
EncodeM v3.0 brings complete M language (MUMPS) subscript encoding to Ruby, supporting numbers, strings, and composite keys with perfect sort order. Build hierarchical database keys like M("users", 42, "email") that sort correctly as raw bytes. This 40-year production-tested algorithm from YottaDB/GT.M powers Epic (70% of US hospitals) and VistA. Perfect for B-tree indexes, key-value stores, and any system requiring sortable hierarchical keys. All types maintain correct ordering when compared as byte strings - no decoding needed.
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
 Dependencies

Development

 Project Readme

EncodeM

Gem Version MIT License

πŸŽ‰ Version 3.0: Complete M language subscript encoding - numbers, strings, and composite keys!

Bringing the power of M language (MUMPS) subscript encoding to Ruby. Build hierarchical database keys like M("users", 42, "email") with perfect sort order. Based on YottaDB/GT.M's 40-year production-tested algorithm.

Why You Should Use EncodeM

Version 3.0 brings complete M language subscript support! Not just numbers anymore - now you can encode strings and build powerful composite keys for hierarchical data structures.

If you're building anything that stores data in a database or key-value store, EncodeM is a game-changer. The magic is simple but powerful: when you encode values with EncodeM, the resulting byte strings maintain perfect sort order. This means your database can compare and sort without ever decoding - just pure byte comparison like strcmp().

What's New in v3.0:

  • String encoding: Strings sort correctly after all numbers
  • Composite keys: Build hierarchical keys like M("users", 42, "profile", "email")
  • Full M compatibility: Generate YottaDB/GT.M compatible subscripts
  • Mixed types: Combine numbers, strings, and more in a single key

Imagine building a user database where M("users", userId, "posts", postId) creates perfectly sortable hierarchical keys. Or time-series data with M(2025, 1, 15, sensorId, "temperature"). The encoding ensures all components sort correctly - numbers before strings, maintaining hierarchical order.

This is production-tested technology - literally the same encoding that's been processing medical records and financial transactions since the 1980s in YottaDB/GT.M systems. Epic (70% of US hospitals) and VistA use this exact algorithm for their global arrays. Drop it in, encode your data, and watch your database operations get faster.

About the M Language Heritage

The M language (formerly MUMPS - Massachusetts General Hospital Utility Multi-Programming System) has been powering critical healthcare and financial systems since 1966. Epic (70% of US hospitals), the VA's VistA, and numerous banking systems run on M. This gem extracts one of M's most clever innovations: a numeric encoding that maintains sort order in byte form.

Key Features

  • Complete M Language Support: Numbers, strings, and composite keys
  • Sortable Byte Encoding: All types encode to bytes that sort correctly without decoding
  • Hierarchical Keys: Build multi-component database keys with perfect sort order
  • Production-Tested: Algorithm proven in healthcare and finance for 40 years
  • YottaDB Compatible: Generate valid YottaDB/GT.M subscripts
  • Memory Efficient: Compact representation for all data types
  • Database-Friendly: Perfect for B-tree indexes and key-value stores

Installation

Add to your Gemfile:

gem 'encode_m'

Or install directly:

$ gem install encode_m

Usage

Numbers (Classic M encoding)

require 'encode_m'

# Create numbers using the M() convenience method
a = M(42)
b = M(3.14)      # Floats are truncated to integers
c = M(-100)

# Arithmetic works naturally
sum = a + b        # => M(45)
product = a * M(2) # => M(84)

# The magic: encoded bytes sort correctly!
numbers = [M(5), M(-10), M(0), M(100), M(-5)]
sorted = numbers.sort  # Correctly sorted: -10, -5, 0, 5, 100

# Perfect for databases - compare without decoding
encoded_a = a.to_encoded  # => "\xBD\x2B"
encoded_b = b.to_encoded  # => "\xBC\x04"
encoded_a < encoded_b      # => false (42 > 3)

Strings (New in v3.0!)

# Encode strings - they sort after all numbers
name = M("Alice")
empty = M("")        # Empty string

# M language ordering: all numbers < all strings
M(999999) < M("0")   # => true

# String comparison maintains byte order
M("apple") < M("banana")  # => true

Composite Keys (New in v3.0!)

# Build hierarchical database keys
user_email = M("users", 42, "email")
user_name = M("users", 42, "name")
user_post = M("users", 42, "posts", 1)

# Perfect for time-series data
event = M(2025, 1, 15, 14, 30, "sensor_123", "temperature")

# Keys sort hierarchically
keys = [
  M("users", 2, "email"),
  M("users", 1, "name"),
  M("users", 1, "email"),
  M("users", 2, "name")
].sort
# Result order:
# ["users", 1, "email"]
# ["users", 1, "name"]
# ["users", 2, "email"]
# ["users", 2, "name"]

# Access components
user_email[0].value  # => "users"
user_email[1].value  # => 42
user_email.to_a      # => ["users", 42, "email"]

# Decode composite keys
encoded = user_email.to_encoded
decoded = EncodeM.decode_composite(encoded)  # => ["users", 42, "email"]

Format Specification

EncodeM uses the complete M language subscript encoding that guarantees lexicographic byte ordering matches logical ordering for all data types.

Encoding Structure

0x00      KEY_DELIMITER (separates components in composite keys)
0x01      STR_SUB_ESCAPE (escape byte for strings)
------- NEGATIVE NUMBERS (decreasing magnitude) -------
0x3B      -999,999,999 to -100,000,000 (9 digits)
0x3C      -99,999,999 to -10,000,000 (8 digits)
0x3D      -9,999,999 to -1,000,000 (7 digits)
0x3E      -999,999 to -100,000 (6 digits)
0x3F      -99,999 to -10,000 (5 digits)
0x40      -9,999 to -1,000 (4 digits)
0x41      -999 to -100 (3 digits)
0x42      -99 to -10 (2 digits)
0x43      -9 to -1 (1 digit)
------- ZERO -------
0x80      ZERO
------- POSITIVE NUMBERS (increasing magnitude) -------
0xBC      1 to 9 (1 digit)
0xBD      10 to 99 (2 digits)
0xBE      100 to 999 (3 digits)
0xBF      1,000 to 9,999 (4 digits)
0xC0      10,000 to 99,999 (5 digits)
0xC1      100,000 to 999,999 (6 digits)
0xC2      1,000,000 to 9,999,999 (7 digits)
0xC3      10,000,000 to 99,999,999 (8 digits)
0xC4      100,000,000 to 999,999,999 (9 digits)
------- STRINGS -------
0xFF      STR_SUB_PREFIX (all strings start with this)

Numeric Encoding

  • First byte: Determines sign and magnitude range
  • Following bytes: Encode digit pairs (00-99) using lookup tables
  • Terminator: Negative numbers end with 0xFF to maintain sort order

String Encoding

  • Prefix: All strings start with 0xFF
  • Content: UTF-8 bytes of the string
  • Escaping: Special bytes are escaped:
    • 0x00 β†’ 0x01 0xFF
    • 0x01 β†’ 0x01 0xFE

Composite Key Encoding

  • Structure: Components separated by 0x00 (KEY_DELIMITER)
  • Ordering: Maintains hierarchical sort order
  • Example: M("users", 42) β†’ [0xFF "users" 0x00 0xBD 0x2B]

Encoding Examples

Value Hex Bytes Description
-1000 3F FD EF FF 4-digit negative
-1 43 FB FF 1-digit negative
0 80 Zero (single byte)
1 BC 02 1-digit positive
42 BD 2B 2-digit positive
1000 BF 0B 01 4-digit positive
"Hello" FF 48 65 6C 6C 6F String with 0xFF prefix
"" FF Empty string
["users", 42] FF 75 73 65 72 73 00 BD 2B Composite key
[2025, 1, 15] BF 14 19 00 BC 02 00 BD 10 Date as composite

The encoding ensures:

  • bytewise_compare(encode(x), encode(y)) == logical_compare(x, y)
  • All numbers sort before all strings
  • Composite keys maintain hierarchical order

Ordering Guarantees

EncodeM provides strict total ordering across all encodable values:

  • Mathematical guarantee: For any numbers x and y: x < y ⟺ encode(x) < encode(y) (bytewise)
  • Sign ordering: All negatives < zero < all positives
  • Magnitude ordering: Within each sign, magnitude determines order
  • Deterministic: Same input always produces same output
  • Stable: No special cases or exceptions

This enables direct byte comparison in databases without decoding.

API Reference

Core Methods

Method Description Example
M(value) Create encoded value M(42), M("hello")
M(*values) Create composite key M("users", 42, "email")
EncodeM.new(value) Create encoded value EncodeM.new(42)
EncodeM.new(*values) Create composite key EncodeM.new("users", 42)
EncodeM.decode(bytes) Decode bytes to value EncodeM.decode("\xBD\x2B") β†’ 42
EncodeM.decode_composite(bytes) Decode composite key Returns array of components
#to_encoded Get encoded byte string M(42).to_encoded β†’ "\xBD\x2B"
#value Get original value M(42).value β†’ 42
#to_a Get composite components M("a", 1).to_a β†’ ["a", 1]

Arithmetic Operations

Operation Description Example
+ Addition M(10) + M(5) β†’ M(15)
- Subtraction M(10) - M(3) β†’ M(7)
* Multiplication M(4) * M(3) β†’ M(12)
/ Division M(10) / M(2) β†’ M(5)
** Exponentiation M(2) ** M(3) β†’ M(8)

Comparison Operations

Operation Description Example
< Less than M(5) < M(10) β†’ true
> Greater than M(10) > M(5) β†’ true
== Equality M(42) == M(42) β†’ true
<= Less or equal M(5) <= M(5) β†’ true
>= Greater or equal M(10) >= M(5) β†’ true
<=> Spaceship operator M(5) <=> M(10) β†’ -1

Numeric Methods

Method Description Example
#to_i Convert to Integer M(3.14).to_i β†’ 3
#to_f Convert to Float M(42).to_f β†’ 42.0
#to_s Convert to String M(42).to_s β†’ "42"
#zero? Check if zero M(0).zero? β†’ true
#positive? Check if positive M(42).positive? β†’ true
#negative? Check if negative M(-5).negative? β†’ true

String Methods

Method Description Example
#to_s Get string value M("hello").to_s β†’ "hello"
#length String length M("hello").length β†’ 5
#empty? Check if empty M("").empty? β†’ true

Composite Methods

Method Description Example
#[] Access component M("a", 1)[0] β†’ M("a")
#length Number of components M("a", 1, "b").length β†’ 3
#to_a Get all components M("a", 1).to_a β†’ ["a", 1]

Edge Cases & Limits

Supported Values

  • Integers: Full range up to 18 digits
  • Floats: Truncated to integers (M language design)
  • Strings: Any UTF-8 string, with automatic escaping
  • Composite Keys: Unlimited components of mixed types
  • Zero: Handled as special case (single byte: 0x80)
  • Negative numbers: Full support with proper ordering
  • Nil: Converted to empty string ""

Not Supported

  • NaN: Raises ArgumentError
  • Infinity: Raises ArgumentError
  • Non-numeric strings: Raises ArgumentError unless parseable
  • nil: Raises ArgumentError
  • Numbers > 18 digits: Precision loss may occur

Behavior Notes

  • Mixed arithmetic with Ruby numbers works via coercion
  • Immutable objects (create new instances, don't modify)
  • Thread-safe (no shared mutable state)
  • No locale dependencies (pure byte operations)

Why EncodeM?

Traditional numeric types force compromises:

Type Speed Precision Memory Sortable as Bytes
Integer ⚑️ Fast βœ… Exact βœ… Efficient ❌ No
Float ⚑️ Fast ❌ Limited βœ… Efficient ❌ No
BigDecimal ❌ Slow βœ… Unlimited ❌ Heavy ❌ No
EncodeM βœ… Good βœ… 18 digits βœ… Efficient βœ… Yes!

EncodeM's unique advantage: encoded bytes maintain sort order, enabling:

  • Direct byte comparison in databases
  • Efficient indexing without decoding
  • Fast range queries on encoded data

Performance Characteristics

Storage Efficiency

  • Small integers (1-99): 2 bytes (vs 8 for Float)
  • Common range (-999 to 999): 2-3 bytes
  • Typical numbers (-10^9 to 10^9): 4-6 bytes
  • Maximum 18 digits: Variable length encoding

Benchmark Results

Database sorting benchmark (1000 numbers):

  • EncodeM (direct byte sort): 8,459 ops/sec
  • Float (decodeβ†’sortβ†’encode): 3,003 ops/sec (2.8x slower)
  • BigDecimal (parseβ†’sortβ†’string): 939 ops/sec (9x slower)

Range query benchmark (find values between -100 and 100):

  • EncodeM (byte comparison): 10,355 ops/sec
  • Float (decode & filter): 5,526 ops/sec (1.9x slower)

Run benchmarks yourself: ruby -I lib test/benchmark_database.rb

Database & KV Store Usage

Direct Byte Comparison for Range Queries

# Store encoded numbers as keys in LMDB/RocksDB
db[M(100).to_encoded] = "user:100"
db[M(200).to_encoded] = "user:200"
db[M(300).to_encoded] = "user:300"

# Range query without decoding - pure byte comparison!
lower = M(150).to_encoded
upper = M(250).to_encoded
db.range(lower, upper)  # Returns user:200

Composite Keys with Sort Order Preserved

# Timestamp + ID composite key
def make_key(timestamp, id)
  M(timestamp).to_encoded + M(id).to_encoded
end

# These sort correctly by timestamp, then by ID
key1 = make_key(1699564800, 42)   # Nov 9, 2023 + ID 42
key2 = make_key(1699564800, 100)  # Nov 9, 2023 + ID 100
key3 = make_key(1699651200, 1)    # Nov 10, 2023 + ID 1

# Byte comparison gives correct chronological order
[key3, key1, key2].sort == [key1, key2, key3]  # => true

Production Notes

Thread Safety

  • Immutable objects: All EncodeM instances are immutable
  • No shared state: Safe for concurrent use across threads
  • Pure functions: Encoding/decoding have no side effects

Determinism & Portability

  • Deterministic encoding: Same input β†’ same bytes, always
  • Architecture independent: No endianness issues
  • No locale dependencies: Pure byte operations
  • Ruby version stable: Tested on Ruby 2.5+ through 3.4

Quality Assurance

  • Test coverage: Comprehensive test suite with edge cases
  • Monotonicity verified: Ordering guaranteed by property tests
  • Round-trip validation: All values encode/decode perfectly
  • 40-year production history: Algorithm battle-tested in healthcare

Performance Considerations

  • Zero allocations for comparison operations
  • Lazy decoding: Compare/sort without materializing numbers
  • Cache-friendly: Sequential byte comparison is CPU-optimal
  • GC-friendly: Small objects, minimal memory pressure

Use Cases

  • Financial Systems: More precision than Float, faster than BigDecimal
  • Database Indexing: Sort encoded bytes directly
  • Time-Series Data: Efficient storage with natural ordering
  • Healthcare Systems: Proven in Epic, VistA, and other M-based systems
  • High-Volume Processing: Efficient encoding for billions of records
  • Cross-System Integration: Compatible with M language databases

References & Attribution

Algorithm Heritage

This gem implements the numeric encoding algorithm from YottaDB and GT.M, which has been proven in production systems for nearly 40 years.

Algorithm Credit:

  • Original design: Greystone Technology Corporation (1980s)
  • Current implementations: YottaDB (AGPLv3) and GT.M
  • Production proven in Epic, VistA, and Profile banking systems

Ruby Implementation:

  • Author: Steve Shreeve (steve.shreeve@gmail.com)
  • Implementation assistance: Claude Opus 4.1 (Anthropic)
  • Clean-room reimplementation: This is an independent implementation of the algorithm concept, not a code translation

Technical References

Development

After checking out the repo, run:

bundle install
rake test

Running Benchmarks

The gem includes two benchmark scripts in the test/ directory:

# Performance benchmark - arithmetic and sorting operations
ruby -I lib test/benchmark.rb

# Database use case benchmark - demonstrates key benefits
ruby -I lib test/benchmark_database.rb

Note: You may need to install bigdecimal gem for Ruby 3.4+:

gem install bigdecimal

Building and Installing

To install this gem locally:

bundle exec rake install

To release a new version:

bundle exec rake release

Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/shreeve/encode_m.

License

The gem is available as open source under the terms of the MIT License.

Acknowledgments

Special thanks to:

  • The YottaDB team for maintaining and open-sourcing this technology
  • The M language community for decades of innovation in database technology
  • Anthropic's Claude Opus 4.1 for assistance with the Ruby implementation