EncodeM
π Version 3.0: Complete M language subscript encoding - numbers, strings, and composite keys!
Bringing the power of M language (MUMPS) subscript encoding to Ruby. Build hierarchical database keys like M("users", 42, "email")
with perfect sort order. Based on YottaDB/GT.M's 40-year production-tested algorithm.
Why You Should Use EncodeM
Version 3.0 brings complete M language subscript support! Not just numbers anymore - now you can encode strings and build powerful composite keys for hierarchical data structures.
If you're building anything that stores data in a database or key-value store, EncodeM is a game-changer. The magic is simple but powerful: when you encode values with EncodeM, the resulting byte strings maintain perfect sort order. This means your database can compare and sort without ever decoding - just pure byte comparison like strcmp().
What's New in v3.0:
- String encoding: Strings sort correctly after all numbers
-
Composite keys: Build hierarchical keys like
M("users", 42, "profile", "email")
- Full M compatibility: Generate YottaDB/GT.M compatible subscripts
- Mixed types: Combine numbers, strings, and more in a single key
Imagine building a user database where M("users", userId, "posts", postId)
creates perfectly sortable hierarchical keys. Or time-series data with M(2025, 1, 15, sensorId, "temperature")
. The encoding ensures all components sort correctly - numbers before strings, maintaining hierarchical order.
This is production-tested technology - literally the same encoding that's been processing medical records and financial transactions since the 1980s in YottaDB/GT.M systems. Epic (70% of US hospitals) and VistA use this exact algorithm for their global arrays. Drop it in, encode your data, and watch your database operations get faster.
About the M Language Heritage
The M language (formerly MUMPS - Massachusetts General Hospital Utility Multi-Programming System) has been powering critical healthcare and financial systems since 1966. Epic (70% of US hospitals), the VA's VistA, and numerous banking systems run on M. This gem extracts one of M's most clever innovations: a numeric encoding that maintains sort order in byte form.
Key Features
- Complete M Language Support: Numbers, strings, and composite keys
- Sortable Byte Encoding: All types encode to bytes that sort correctly without decoding
- Hierarchical Keys: Build multi-component database keys with perfect sort order
- Production-Tested: Algorithm proven in healthcare and finance for 40 years
- YottaDB Compatible: Generate valid YottaDB/GT.M subscripts
- Memory Efficient: Compact representation for all data types
- Database-Friendly: Perfect for B-tree indexes and key-value stores
Installation
Add to your Gemfile:
gem 'encode_m'
Or install directly:
$ gem install encode_m
Usage
Numbers (Classic M encoding)
require 'encode_m'
# Create numbers using the M() convenience method
a = M(42)
b = M(3.14) # Floats are truncated to integers
c = M(-100)
# Arithmetic works naturally
sum = a + b # => M(45)
product = a * M(2) # => M(84)
# The magic: encoded bytes sort correctly!
numbers = [M(5), M(-10), M(0), M(100), M(-5)]
sorted = numbers.sort # Correctly sorted: -10, -5, 0, 5, 100
# Perfect for databases - compare without decoding
encoded_a = a.to_encoded # => "\xBD\x2B"
encoded_b = b.to_encoded # => "\xBC\x04"
encoded_a < encoded_b # => false (42 > 3)
Strings (New in v3.0!)
# Encode strings - they sort after all numbers
name = M("Alice")
empty = M("") # Empty string
# M language ordering: all numbers < all strings
M(999999) < M("0") # => true
# String comparison maintains byte order
M("apple") < M("banana") # => true
Composite Keys (New in v3.0!)
# Build hierarchical database keys
user_email = M("users", 42, "email")
user_name = M("users", 42, "name")
user_post = M("users", 42, "posts", 1)
# Perfect for time-series data
event = M(2025, 1, 15, 14, 30, "sensor_123", "temperature")
# Keys sort hierarchically
keys = [
M("users", 2, "email"),
M("users", 1, "name"),
M("users", 1, "email"),
M("users", 2, "name")
].sort
# Result order:
# ["users", 1, "email"]
# ["users", 1, "name"]
# ["users", 2, "email"]
# ["users", 2, "name"]
# Access components
user_email[0].value # => "users"
user_email[1].value # => 42
user_email.to_a # => ["users", 42, "email"]
# Decode composite keys
encoded = user_email.to_encoded
decoded = EncodeM.decode_composite(encoded) # => ["users", 42, "email"]
Format Specification
EncodeM uses the complete M language subscript encoding that guarantees lexicographic byte ordering matches logical ordering for all data types.
Encoding Structure
0x00 KEY_DELIMITER (separates components in composite keys)
0x01 STR_SUB_ESCAPE (escape byte for strings)
------- NEGATIVE NUMBERS (decreasing magnitude) -------
0x3B -999,999,999 to -100,000,000 (9 digits)
0x3C -99,999,999 to -10,000,000 (8 digits)
0x3D -9,999,999 to -1,000,000 (7 digits)
0x3E -999,999 to -100,000 (6 digits)
0x3F -99,999 to -10,000 (5 digits)
0x40 -9,999 to -1,000 (4 digits)
0x41 -999 to -100 (3 digits)
0x42 -99 to -10 (2 digits)
0x43 -9 to -1 (1 digit)
------- ZERO -------
0x80 ZERO
------- POSITIVE NUMBERS (increasing magnitude) -------
0xBC 1 to 9 (1 digit)
0xBD 10 to 99 (2 digits)
0xBE 100 to 999 (3 digits)
0xBF 1,000 to 9,999 (4 digits)
0xC0 10,000 to 99,999 (5 digits)
0xC1 100,000 to 999,999 (6 digits)
0xC2 1,000,000 to 9,999,999 (7 digits)
0xC3 10,000,000 to 99,999,999 (8 digits)
0xC4 100,000,000 to 999,999,999 (9 digits)
------- STRINGS -------
0xFF STR_SUB_PREFIX (all strings start with this)
Numeric Encoding
- First byte: Determines sign and magnitude range
- Following bytes: Encode digit pairs (00-99) using lookup tables
-
Terminator: Negative numbers end with
0xFF
to maintain sort order
String Encoding
-
Prefix: All strings start with
0xFF
- Content: UTF-8 bytes of the string
-
Escaping: Special bytes are escaped:
-
0x00
β0x01 0xFF
-
0x01
β0x01 0xFE
-
Composite Key Encoding
-
Structure: Components separated by
0x00
(KEY_DELIMITER) - Ordering: Maintains hierarchical sort order
-
Example:
M("users", 42)
β[0xFF "users" 0x00 0xBD 0x2B]
Encoding Examples
Value | Hex Bytes | Description |
---|---|---|
-1000 | 3F FD EF FF |
4-digit negative |
-1 | 43 FB FF |
1-digit negative |
0 | 80 |
Zero (single byte) |
1 | BC 02 |
1-digit positive |
42 | BD 2B |
2-digit positive |
1000 | BF 0B 01 |
4-digit positive |
"Hello" | FF 48 65 6C 6C 6F |
String with 0xFF prefix |
"" | FF |
Empty string |
["users", 42] | FF 75 73 65 72 73 00 BD 2B |
Composite key |
[2025, 1, 15] | BF 14 19 00 BC 02 00 BD 10 |
Date as composite |
The encoding ensures:
bytewise_compare(encode(x), encode(y)) == logical_compare(x, y)
- All numbers sort before all strings
- Composite keys maintain hierarchical order
Ordering Guarantees
EncodeM provides strict total ordering across all encodable values:
-
Mathematical guarantee: For any numbers x and y:
x < y βΊ encode(x) < encode(y)
(bytewise) - Sign ordering: All negatives < zero < all positives
- Magnitude ordering: Within each sign, magnitude determines order
- Deterministic: Same input always produces same output
- Stable: No special cases or exceptions
This enables direct byte comparison in databases without decoding.
API Reference
Core Methods
Method | Description | Example |
---|---|---|
M(value) |
Create encoded value |
M(42) , M("hello")
|
M(*values) |
Create composite key | M("users", 42, "email") |
EncodeM.new(value) |
Create encoded value | EncodeM.new(42) |
EncodeM.new(*values) |
Create composite key | EncodeM.new("users", 42) |
EncodeM.decode(bytes) |
Decode bytes to value |
EncodeM.decode("\xBD\x2B") β 42
|
EncodeM.decode_composite(bytes) |
Decode composite key | Returns array of components |
#to_encoded |
Get encoded byte string |
M(42).to_encoded β "\xBD\x2B"
|
#value |
Get original value |
M(42).value β 42
|
#to_a |
Get composite components |
M("a", 1).to_a β ["a", 1]
|
Arithmetic Operations
Operation | Description | Example |
---|---|---|
+ |
Addition |
M(10) + M(5) β M(15)
|
- |
Subtraction |
M(10) - M(3) β M(7)
|
* |
Multiplication |
M(4) * M(3) β M(12)
|
/ |
Division |
M(10) / M(2) β M(5)
|
** |
Exponentiation |
M(2) ** M(3) β M(8)
|
Comparison Operations
Operation | Description | Example |
---|---|---|
< |
Less than |
M(5) < M(10) β true
|
> |
Greater than |
M(10) > M(5) β true
|
== |
Equality |
M(42) == M(42) β true
|
<= |
Less or equal |
M(5) <= M(5) β true
|
>= |
Greater or equal |
M(10) >= M(5) β true
|
<=> |
Spaceship operator |
M(5) <=> M(10) β -1
|
Numeric Methods
Method | Description | Example |
---|---|---|
#to_i |
Convert to Integer |
M(3.14).to_i β 3
|
#to_f |
Convert to Float |
M(42).to_f β 42.0
|
#to_s |
Convert to String |
M(42).to_s β "42"
|
#zero? |
Check if zero |
M(0).zero? β true
|
#positive? |
Check if positive |
M(42).positive? β true
|
#negative? |
Check if negative |
M(-5).negative? β true
|
String Methods
Method | Description | Example |
---|---|---|
#to_s |
Get string value |
M("hello").to_s β "hello"
|
#length |
String length |
M("hello").length β 5
|
#empty? |
Check if empty |
M("").empty? β true
|
Composite Methods
Method | Description | Example |
---|---|---|
#[] |
Access component |
M("a", 1)[0] β M("a")
|
#length |
Number of components |
M("a", 1, "b").length β 3
|
#to_a |
Get all components |
M("a", 1).to_a β ["a", 1]
|
Edge Cases & Limits
Supported Values
- Integers: Full range up to 18 digits
- Floats: Truncated to integers (M language design)
- Strings: Any UTF-8 string, with automatic escaping
- Composite Keys: Unlimited components of mixed types
-
Zero: Handled as special case (single byte:
0x80
) - Negative numbers: Full support with proper ordering
-
Nil: Converted to empty string
""
Not Supported
-
NaN: Raises
ArgumentError
-
Infinity: Raises
ArgumentError
-
Non-numeric strings: Raises
ArgumentError
unless parseable -
nil: Raises
ArgumentError
- Numbers > 18 digits: Precision loss may occur
Behavior Notes
- Mixed arithmetic with Ruby numbers works via coercion
- Immutable objects (create new instances, don't modify)
- Thread-safe (no shared mutable state)
- No locale dependencies (pure byte operations)
Why EncodeM?
Traditional numeric types force compromises:
Type | Speed | Precision | Memory | Sortable as Bytes |
---|---|---|---|---|
Integer | β‘οΈ Fast | β Exact | β Efficient | β No |
Float | β‘οΈ Fast | β Limited | β Efficient | β No |
BigDecimal | β Slow | β Unlimited | β Heavy | β No |
EncodeM | β Good | β 18 digits | β Efficient | β Yes! |
EncodeM's unique advantage: encoded bytes maintain sort order, enabling:
- Direct byte comparison in databases
- Efficient indexing without decoding
- Fast range queries on encoded data
Performance Characteristics
Storage Efficiency
- Small integers (1-99): 2 bytes (vs 8 for Float)
- Common range (-999 to 999): 2-3 bytes
- Typical numbers (-10^9 to 10^9): 4-6 bytes
- Maximum 18 digits: Variable length encoding
Benchmark Results
Database sorting benchmark (1000 numbers):
- EncodeM (direct byte sort): 8,459 ops/sec
- Float (decodeβsortβencode): 3,003 ops/sec (2.8x slower)
- BigDecimal (parseβsortβstring): 939 ops/sec (9x slower)
Range query benchmark (find values between -100 and 100):
- EncodeM (byte comparison): 10,355 ops/sec
- Float (decode & filter): 5,526 ops/sec (1.9x slower)
Run benchmarks yourself: ruby -I lib test/benchmark_database.rb
Database & KV Store Usage
Direct Byte Comparison for Range Queries
# Store encoded numbers as keys in LMDB/RocksDB
db[M(100).to_encoded] = "user:100"
db[M(200).to_encoded] = "user:200"
db[M(300).to_encoded] = "user:300"
# Range query without decoding - pure byte comparison!
lower = M(150).to_encoded
upper = M(250).to_encoded
db.range(lower, upper) # Returns user:200
Composite Keys with Sort Order Preserved
# Timestamp + ID composite key
def make_key(timestamp, id)
M(timestamp).to_encoded + M(id).to_encoded
end
# These sort correctly by timestamp, then by ID
key1 = make_key(1699564800, 42) # Nov 9, 2023 + ID 42
key2 = make_key(1699564800, 100) # Nov 9, 2023 + ID 100
key3 = make_key(1699651200, 1) # Nov 10, 2023 + ID 1
# Byte comparison gives correct chronological order
[key3, key1, key2].sort == [key1, key2, key3] # => true
Production Notes
Thread Safety
- Immutable objects: All EncodeM instances are immutable
- No shared state: Safe for concurrent use across threads
- Pure functions: Encoding/decoding have no side effects
Determinism & Portability
- Deterministic encoding: Same input β same bytes, always
- Architecture independent: No endianness issues
- No locale dependencies: Pure byte operations
- Ruby version stable: Tested on Ruby 2.5+ through 3.4
Quality Assurance
- Test coverage: Comprehensive test suite with edge cases
- Monotonicity verified: Ordering guaranteed by property tests
- Round-trip validation: All values encode/decode perfectly
- 40-year production history: Algorithm battle-tested in healthcare
Performance Considerations
- Zero allocations for comparison operations
- Lazy decoding: Compare/sort without materializing numbers
- Cache-friendly: Sequential byte comparison is CPU-optimal
- GC-friendly: Small objects, minimal memory pressure
Use Cases
- Financial Systems: More precision than Float, faster than BigDecimal
- Database Indexing: Sort encoded bytes directly
- Time-Series Data: Efficient storage with natural ordering
- Healthcare Systems: Proven in Epic, VistA, and other M-based systems
- High-Volume Processing: Efficient encoding for billions of records
- Cross-System Integration: Compatible with M language databases
References & Attribution
Algorithm Heritage
This gem implements the numeric encoding algorithm from YottaDB and GT.M, which has been proven in production systems for nearly 40 years.
Algorithm Credit:
- Original design: Greystone Technology Corporation (1980s)
- Current implementations: YottaDB (AGPLv3) and GT.M
- Production proven in Epic, VistA, and Profile banking systems
Ruby Implementation:
- Author: Steve Shreeve (steve.shreeve@gmail.com)
- Implementation assistance: Claude Opus 4.1 (Anthropic)
- Clean-room reimplementation: This is an independent implementation of the algorithm concept, not a code translation
Technical References
- YottaDB Collation Documentation - M language collation sequences
- YottaDB Programmer's Guide - General M language reference
- MUMPS Wikipedia - Overview of M language history
Development
After checking out the repo, run:
bundle install
rake test
Running Benchmarks
The gem includes two benchmark scripts in the test/
directory:
# Performance benchmark - arithmetic and sorting operations
ruby -I lib test/benchmark.rb
# Database use case benchmark - demonstrates key benefits
ruby -I lib test/benchmark_database.rb
Note: You may need to install bigdecimal
gem for Ruby 3.4+:
gem install bigdecimal
Building and Installing
To install this gem locally:
bundle exec rake install
To release a new version:
bundle exec rake release
Contributing
Bug reports and pull requests are welcome on GitHub at https://github.com/shreeve/encode_m.
License
The gem is available as open source under the terms of the MIT License.
Acknowledgments
Special thanks to:
- The YottaDB team for maintaining and open-sourcing this technology
- The M language community for decades of innovation in database technology
- Anthropic's Claude Opus 4.1 for assistance with the Ruby implementation