0.01
No release in over 3 years
Low commit activity in last 3 years
Floating-Point Formats
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
 Dependencies

Development

>= 1.9
>= 12.3.3
>= 3.2

Runtime

~> 1.5
~> 0.3
 Project Readme

Gem Version Build Status

Introduction

Float-Formats is a Ruby package with methods to handle diverse floating-point formats. These are some of the things that can be done with it:

  • Encoding and decoding numerical values in specific floating point representations.
  • Conversion of floating-point data between different formats.
  • Obtaining properties of floating-point formats (ranges, precision, etc.)
  • Exploring and learning about floating point representations.
  • Definition and testing of new floating-point formats.

Installation

To install the gem manually:

gem install float-formats

You can find the code in GitHub:

Predefined formats

A number of common formats are defined as constants in the Flt module:

IEEE 754-2008

Binary floating point representations in little endian order:

  • IEEE_binary16 (half precision),
  • IEEE_binary32 (single precision),
  • IEEE_binary64 (double precision),
  • IEEE_binary80 (extended), IEEE_binary128 (quadruple precision) and as little endian: IEEE_binary16_BE, etc.

Decimal formats (using DPD):

  • IEEE_decimal32, IEEE_decimal64 and IEEE_decimal128.

Interchange binary & decimal formats:

  • IEEE_binary256, IEEE_binary512, IEEE_binary1024, IEEE_decimal192, IEEE_decimal256.

Others can be defined with IEEE.interchange_binary and IEEE.interchange_decimal (see the IEEE module).

Legacy

Formats of historical interest, some of which are found in file formats still in use.

Mainframe/supercomputer formats: Univac 1100 (UNIVAC_SINGLE, UNIVAC_DOUBLE), IBM 360 etc. (IBM32, IBM64 and IBM128), CDC 6600/7600: (CDC_SINGLE, CDC_DOUBLE), Cray-1: (CRAY).

Minis: PDP11 and Vaxes: (PDP11_F, PDP11_D, VAX_F, VAX_D, VAX_G and VAX_H), HP3000: (XS256, XS256_DOUBLE), Wang 2200: (WANG2200).

Microcomputers (software implementations): Apple II: (APPLE), Microsoft Basic, Spectrum, etc.: (XS128), Microsoft Quickbasic: (MBF_SINGLE, MBF_DOUBLE), Borland Pascal: (BORLAND48).

Embedded systems: Formats used in the Intel 8051 by the C51 compiler: (C51_BCD_FLOAT, C51_BCD_DOUBLE and C51_BCD_LONG_DOUBLE).

Minifloats: AI Formats: (BFLOAT16, MSFP8, MSFP9, MSFP10, MSFP11) Khronos Vulkan unsigned formats: (KHRONOS_VULKAN_UNSIGNED10, KHRONOS_VULKAN_UNSIGNED11)

Calculators

Formats used in HP RPL calculators: (RPL, RPL_X), HP-71B formats (HP71B, HP71B_X) and classic HP 10 digit calculators: (HP_CLASSIC).

Using the pre-defined formats

require 'rubygems'
require 'float-formats'
include Flt

The properties of the floating point formats can be queried (which can be used for tables or reports comparing different formats):

Size in bits of the representations:

puts IEEE_binary32.total_bits                        # -> 32

Numeric radix:

puts IEEE_binary32.radix                             # -> 2

Digits of precision (radix-based)

puts IEEE_binary32.significand_digits                # -> 24

Minimum and maximum values of the radix-based exponent:

puts IEEE_binary32.radix_min_exp                     # -> -126
puts IEEE_binary32.radix_max_exp                     # -> 127

Decimal precision

puts IEEE_binary32.decimal_digits_stored             # -> 6
puts IEEE_binary32.decimal_digits_necessary          # -> 9

Minimum and maximum decimal exponents:

puts IEEE_binary32.decimal_min_exp                   # -> -37
puts IEEE_binary32.decimal_max_exp                   # -> 38

Encode and decode numbers

For each floating-point format class there is a constructor method with the same name which can build a floating-point value from a variety of parameters:

  • Using three integers: the sign (+1 for +, -1 for -), the significand (coefficient or mantissa) and the exponent.
  • From a text numeral (with an optional Numerals format specifier)
  • From a number : converts a numerical value to a floating point representation.

Examples:

File.open('binary_file.dat','wb'){|f| f.write IEEE_binary80('0.1').to_bytes}
puts IEEE_binary80('0.1').to_hex(true)           # -> CD CC CC CC CC CC CC CC FB 3F
puts IEEE_binary80(0.1).to_hex(true)             # -> CD CC CC CC CC CC CC CC FB 3F
puts IEEE_binary80(+1,123,-2).to_hex(true)       # -> 00 00 00 00 00 00 00 F6 03 40
puts IEEE_decimal32('1.234').to_hex(true)        # -> 22 20 05 34

A floating-point encoded value can be converted to useful formats with the to_ and similar methods:

  • split (split as integral sign, significand, exponent)
  • to_text
  • to(num_class)

Examples:

v = IEEE_binary80.from_bytes(File.read('binary_file.dat'))
puts v.to(Rational)                              # -> 1/10
puts v.split.inspect                             # -> [1, 14757395258967641293, -67]
puts v.to_text                                   # -> 0.1
puts v.to(Float)                                 # -> 0.1
puts v.to_hex                                    # -> CDCCCCCCCCCCCCCCFB3F
puts v.to_bits                                   # -> 00111111111110111100110011001100110011001100110011001100110011001100110011001101
puts v.to_bits_text(16)                          # -> 3ffbcccccccccccccccd

Special values:

Let's show the decimal expression of some interesting values using 3 significative digits:

fmt = Numerals::Format[mode: :general, rounding: [precision: 3]]
puts IEEE_SINGLE.min_value.to_text(fmt)             # -> 1e-45
puts IEEE_SINGLE.min_normalized_value.to_text(fmt)  # -> 1.18e-38
puts IEEE_SINGLE.max_value.to_text(fmt)             # -> 3.40e38
puts IEEE_SINGLE.epsilon.to_text(fmt)               # -> 1.19e-7

Convert between formats

v = IEEE_EXTENDED.from_text('1.1')
v = v.convert_to(IEEE_SINGLE)
v = v.convert_to(IEEE_DEC64)

Tools for the native floating point format

This is an optional module to perform conversions and manipulate the native Float format.

require 'float-formats/native'
include Flt

puts float_shortest_dec(0.1)                     # -> 0.1
puts float_significant_dec(0.1)                  # -> 0.10000000000000001
puts float_dec(0.1)                              # -> 0.1000000000000000055511151231257827021181583404541015625
puts float_bin(0.1)                              # -> 1.100110011001100110011001100110011001100110011001101E-4
puts hex_from_float(0.1)                         # -> 0x1999999999999ap-56

puts float_significant_dec(Float::MIN_D)           # -> 5E-324
puts float_significant_dec(Float::MAX_D)           # -> 2.2250738585072009E-308
puts float_significant_dec(Float::MIN_N)           # -> 2.2250738585072014E-308

Together with flt/sugar (from Flt) can be use to explore or work with Floats:

require 'flt/sugar'

puts 1.0.next_plus-1 == Float::EPSILON              # -> true
puts float_shortest_dec(1.0.next_plus)              # -> 1.0000000000000002
puts float_dec(1.0.next_minus)                      # -> 0.99999999999999988897769753748434595763683319091796875
puts float_dec(1.0.next_plus)                       # -> 1.0000000000000002220446049250313080847263336181640625
puts float_bin(1.0.next_plus)                       # -> 1.0000000000000000000000000000000000000000000000000001E0
puts float_bin(1.0.next_minus)                      # -> 1.1111111111111111111111111111111111111111111111111111E-1

puts float_significant_dec(Float::MIN_D.next_plus)  # -> 1.0E-323
puts float_significant_dec(Float::MAX_D.next_minus) # -> 2.2250738585072004E-308

Defining new formats

New formats are defined using one of the classes defined in float-formats/classes.rb and passing the necessary parameters in a hash to the constructor.

For example, here we define a binary floating point 32-bits format with 22 bits for the significand, 9 for the exponent and 1 for the sign (these fields are allocated from least to most significant bits). We'll use excess notation with bias 127 for the exponent, interpreting the significand bits as a fractional number with the radix point after the first bit, which will be hidden:

Flt.define(
  :MY_FP, BinaryFormat,
  fields: [:significand,22,:exponent,9,:sign,1],
  bias: 127, bias_mode: :scientific_significand,
  hidden_bit: true
)

Now we can encode values in this format, decode values, convet to other formats, query it's range, etc:

Numerals::Format[mode: :general, rounding: [precision: 3]] puts MY_FP('0.1').to_bits_text(16) # -> 1EE66666 puts MY_FP.max_value.to_text # -> 7.8804e115

You can look at float-formats/formats.rb to see how the built-in formats are defined.

License

This code is free to use under the terms of the MIT license.

References

Floating Point Representations. C.B. Silio. Description of formats used in UNIVAC 1100, CDC 6600/7600, PDP-11, IEEE754, IBM360/370

Floating-Point Formats. John Savard. Description of formats used in VAX and PDF-11

IEEE754 binary formats

IEEE-754 References. Christopher Vickery.

What Every Computer Scientist Should Know About Floating-Point Arithmetic. David Goldberg.

DPD/IEEE754r decimal formats

Decimal Arithmetic Encoding. Strawman 4d. Mike Cowlishaw.

A Summary of Densely Packed Decimal encoding. Mike Cowlishaw.

Packed Decimal Encoding IEEE-754-r. J.H.M. Bonten.

DRAFT Standard for Floating-Point Arithmetic P754. IEEE.

HP 10 digits calculators

HP CPU and Programming. David G.Hicks. Description of calculator CPUs from the Museum of HP Calculators.

HP 35 ROM step by step. Jacques Laporte Description of HP35 registers.

Scientific Pocket Calculator Extends Range of Built-In Functions. Eric A. Evett, Paul J. McClellan, Joseph P. Tanzini. Hewlett Packard Journal 1983-05 pgs 27-28. Describes format used in HP-15C.

HP 12 digits calculators

Software Internal Design Specification Volume I For the HP-71. Hewlett Packard. Available from http://www.hpmuseum.org/cd/cddesc.htm

RPL PROGRAMMING GUIDE Excerpted from RPL: A Mathematical Control Language. by W. C. Wickes. Available at http://www.hpcalc.org/details.php?id=1743

HP-3000

A Pocket Calculator for Computer Science Professionals. Eric A. Evett. Hewlett Packard Journal 1983-05 pg 37. Describes format used in HP-3000

IBM

IBM Floating Point Architecture. Wikipedia.

The IBM eServer z990 floating-point unit. G. Gerwig, H. Wetter, E. M. Schwarz, J. Haess, C. A. Krygowski, B. M. Fleischer and M. Kroener.

MBF

Microsoft Knowledbase Article 35826

Microsoft MBF2IEEE library

Borland

An Overview of Floating Point Numbers. Borland Developer Support Staff

Pascal Floating-Point Page. J R Stockton.

8-bit micros

This is the MS Basic format (BASIC09 for TRS-80 Color Computer, Dragon), also used in the Sinclair Spectrum.

Numbers are followed by information not in listings Sinclair User October 1983 http://www.sincuser.f9.co.uk/019/helplne.htm

Sinclair ZX Spectrum / Basic Programming.. Steven Vickers. Chapter 24. http://www.worldofspectrum.org/ZXBasicManual/zxmanchap24.html

Apple II

Floating Point Routines for the 6502 Roy Rankin and Steve Wozniak. Dr. Dobb's Journal, August 1976, pages 17-19.

C51

Advanced Development System Franklin Software, Inc.

CDC6600

CONTROL DATA 6400/6500/6600 COMPUTER SYSTEMS Reference Manual Manuals available at http://bitsavers.org/

Cray

CRAY-1 COMPUTER SYSTEM Hardware Reference Manual See pg 3-20 from 2240004 or pg 4-30 from HR-0808 or pg 4-21 from HP-0032. Manuals available at http://bitsavers.org/

Wang 2200

Internal Floating Point Representation