DHeap
A fast dary heap priority queue implementation for ruby, implemented as a C extension.
With a regular queue, you expect "FIFO" behavior: first in, first out. With a stack you expect "LIFO": last in first out. With a priority queue, you push elements along with a score and the lowest scored element is the first to be popped. Priority queues are often used in algorithms for e.g. scheduling of timers or bandwidth management, Huffman coding, and various graph search algorithms such as Dijkstra's algorithm, A* search, or Prim's algorithm.
The dary heap data structure is a generalization of the binary heap, in
which the nodes have d children instead of 2. This allows for "decrease
priority" operations to be performed more quickly with the tradeoff of slower
delete minimum. Additionally, dary heaps can have better memory cache
behavior than binary heaps, allowing them to run more quickly in practice
despite slower worstcase time complexity. In the worst case, a dary heap
requires only O(log n / log d)
to push, with the tradeoff that pop is O(d log n / log d)
.
Although you should probably just use the default d value of 4
(see the
analysis below), it's always advisable to benchmark your specific usecase.
Usage
The basic API is:

heap << object
adds a value as its own score. 
heap.push(score, value)
adds a value with an extrinsic score. 
heap.pop
removes and returns the value with the minimum score. 
heap.pop_lte(score)
pops if the minimum score is<=
the provided score. 
heap.peek
to view the minimum value without popping it.
The score must be Integer
or Float
or convertable to a Float
via
Float(score)
(i.e. it should implement #to_f
). Constraining scores to
numeric values gives a 40+% speedup under some benchmarks!
n.b. Integer
scores must have an absolute value that fits into unsigned long long
. This is architecture dependant but on an IA64 system this is 64
bits, which gives a range of 18,446,744,073,709,551,615 to
+18446744073709551615.
Comparing arbitary objects via a <=> b
was the original design and may
be added back in a future version, if (and only if) it can be done without
impacting the speed of numeric comparisons.
require "d_heap"
Task = Struct.new(:id) # for demonstration
heap = DHeap.new # defaults to a 4heap
# storing [score, value] tuples
heap.push Time.now + 5*60, Task.new(1)
heap.push Time.now + 30, Task.new(2)
heap.push Time.now + 60, Task.new(3)
heap.push Time.now + 5, Task.new(4)
# peeking and popping (using last to get the task and ignore the time)
heap.pop # => Task[4]
heap.pop # => Task[2]
heap.peek # => Task[3], but don't pop it from the heap
heap.pop # => Task[3]
heap.pop # => Task[1]
heap.empty? # => true
heap.pop # => nil
If your values behave as their own score, by being convertible via
Float(value)
, then you can use #<<
for implicit scoring. The score should
not change for as long as the value remains in the heap, since it will not be
reevaluated after being pushed.
heap.clear
# The score can be derived from the value by using to_f.
# "a <=> b" is *much* slower than comparing numbers, so it isn't used.
class Event
include Comparable
attr_reader :time, :payload
alias_method :to_time, :time
def initialize(time, payload)
@time = time.to_time
@payload = payload
freeze
end
def to_f
time.to_f
end
def <=>(other)
to_f <=> other.to_f
end
end
heap << comparable_max # sorts last, using <=>
heap << comparable_min # sorts first, using <=>
heap << comparable_mid # sorts in the middle, using <=>
heap.pop # => comparable_min
heap.pop # => comparable_mid
heap.pop # => comparable_max
heap.empty? # => true
heap.pop # => nil
You can also pass a value into #pop(max)
which will only pop if the minimum
score is less than or equal to max
.
Read the API documentation for more detailed documentation and examples.
Installation
Add this line to your application's Gemfile:
gem 'd_heap'
And then execute:
$ bundle install
Or install it yourself as:
$ gem install d_heap
Motivation
One naive approach to a priority queue is to maintain an array in sorted order.
This can be very simply implemented using Array#bseach_index
+ Array#insert
.
This can be very fast—Array#pop
is O(1)
—but the worstcase for insert is
O(n)
because it may need to memcpy
a significant portion of the array.
The standard way to implement a priority queue is with a binary heap. Although
this increases the time for pop
, it converts the amortized time per push + pop
from O(n)
to O(d log n / log d)
.
However, I was surprised to find that—at least for some benchmarks—my pure ruby
heap implementation was much slower than inserting into and popping from a fully
sorted array. The reason for this surprising result: Although it is O(n)
,
memcpy
has a very small constant factor, and calling <=>
from ruby code
has relatively much larger constant factors. If your queue contains only a
few thousand items, the overhead of those extra calls to <=>
is far more
than occasionally calling memcpy
. In the worst case, a dheap will require
d + 1
times more comparisons for each push + pop than a bsearch
+ insert
sorted array.
Moving the siftup and siftdown code into C helps some. But much more helpful
is optimizing the comparison of numeric scores, so a <=> b
never needs to be
called. I'm hopeful that MJIT will eventually obsolete this Cextension. JRuby
or TruffleRuby may already run the pure ruby version at high speed. This can be
hotspot code, and the basic ruby implementation should perform well if not for
the high overhead of <=>
.
Analysis
Time complexity
There are two fundamental heap operations: siftup (used by push) and siftdown (used by pop).
 Both sift operations can perform as many as
log n / log d
swaps, as the element may sift from the bottom of the tree to the top, or vice versa.  Siftup performs a single comparison per swap:
O(1)
. So pushing a new element isO(log n / log d)
.  Swap down performs as many as d comparions per swap:
O(d)
. So popping the min element isO(d log n / log d)
.
Assuming every inserted element is eventually deleted from the root, d=4 requires the fewest comparisons for combined insert and delete:
 (1 + 2) lg 2 = 4.328085
 (1 + 3) lg 3 = 3.640957
 (1 + 4) lg 4 = 3.606738
 (1 + 5) lg 5 = 3.728010
 (1 + 6) lg 6 = 3.906774
 etc...
Leaf nodes require no comparisons to shift down, and higher values for d have higher percentage of leaf nodes:
 d=2 has ~50% leaves,
 d=3 has ~67% leaves,
 d=4 has ~75% leaves,
 and so on...
See https://en.wikipedia.org/wiki/Dary_heap#Analysis for deeper analysis.
Space complexity
Space usage is linear, regardless of d. However higher d values may provide better cache locality. Because the heap is a complete binary tree, the elements can be stored in an array, without the need for tree or list pointers.
Ruby can compare Numeric values much faster than other ruby objects, even if
those objects simply delegate comparison to internal Numeric values. And it is
often useful to use external scores for otherwise uncomparable values. So
DHeap
uses twice as many entries (one for score and one for value)
as an array which only stores values.
Benchmarks
See bin/benchmarks
and docs/benchmarks.txt
, as well as bin/profile
and
docs/profile.txt
for more details or updated results. These benchmarks were
measured with v0.4.0 and ruby 2.7.2 without MJIT enabled.
These benchmarks use very simple implementations for a pureruby heap and an
array that is kept sorted using Array#bsearch_index
and Array#insert
. For
comparison, an alternate implementation Array#min
and Array#delete_at
is
also shown.
Three different scenarios are measured:
 push N values but never pop (clearing between each set of pushes).
 push N values and then pop N values. Although this could be used for heap sort, we're unlikely to choose heap sort over Ruby's quick sort implementation. I'm using this scenario to represent the amortized cost of creating a heap and (eventually) draining it.
 For a heap of size N, repeatedly push and pop while keeping a stable size. This is a very simple approximation for how most scheduler/timer heaps would be used. Usually when a timer fires it will be quickly replaced by a new timer, and the overall count of timers will remain roughly stable.
In these benchmarks, DHeap
runs faster than all other implementations for
every scenario and every value of N, although the difference is much more
noticable at higher values of N. The pure ruby heap implementation is
competitive for push
alone at every value of N, but is significantly slower
than bsearch + insert for push + pop until N is very large (somewhere between
10k and 100k)!
For very small N values the benchmark implementations, DHeap
runs faster than
the other implementations for each scenario, although the difference is still
relatively small. The pure ruby binary heap is 2x or more slower than bsearch +
insert for common common push/pop scenario.
== push N (N=5) ==========================================================
push N (c_dheap): 1701338.1 i/s
push N (rb_heap): 971614.1 i/s  1.75x slower
push N (bsearch): 946363.7 i/s  1.80x slower
== push N then pop N (N=5) ===============================================
push N + pop N (c_dheap): 1087944.8 i/s
push N + pop N (findmin): 841708.1 i/s  1.29x slower
push N + pop N (bsearch): 773252.7 i/s  1.41x slower
push N + pop N (rb_heap): 471852.9 i/s  2.31x slower
== Push/pop with prefilled queue of size=N (N=5) ========================
push + pop (c_dheap): 5525418.8 i/s
push + pop (findmin): 5003904.8 i/s  1.10x slower
push + pop (bsearch): 4320581.8 i/s  1.28x slower
push + pop (rb_heap): 2207042.0 i/s  2.50x slower
By N=21, DHeap
has pulled significantly ahead of bsearch + insert for all
scenarios, but the pure ruby heap is still slower than every other
implementation—even resorting the array after every #push
—in any scenario that
uses #pop
.
== push N (N=21) =========================================================
push N (c_dheap): 408307.0 i/s
push N (rb_heap): 212275.2 i/s  1.92x slower
push N (bsearch): 169583.2 i/s  2.41x slower
== push N then pop N (N=21) ==============================================
push N + pop N (c_dheap): 199435.5 i/s
push N + pop N (findmin): 162024.5 i/s  1.23x slower
push N + pop N (bsearch): 146284.3 i/s  1.36x slower
push N + pop N (rb_heap): 72289.0 i/s  2.76x slower
== Push/pop with prefilled queue of size=N (N=21) =======================
push + pop (c_dheap): 4836860.0 i/s
push + pop (findmin): 4467453.9 i/s  1.08x slower
push + pop (bsearch): 3345458.4 i/s  1.45x slower
push + pop (rb_heap): 1560476.3 i/s  3.10x slower
At higher values of N, DHeap
's logarithmic growth leads to little slowdown
of DHeap#push
, while insert's linear growth causes it to run slower and
slower. But because #pop
is O(1) for a sorted array and O(d log n / log d)
for a dheap, scenarios involving #pop
remain relatively close even as N
increases to 5k:
== Push/pop with prefilled queue of size=N (N=5461) ==============
push + pop (c_dheap): 2718225.1 i/s
push + pop (bsearch): 1793546.4 i/s  1.52x slower
push + pop (rb_heap): 707139.9 i/s  3.84x slower
push + pop (findmin): 122316.0 i/s  22.22x slower
Somewhat surprisingly, bsearch + insert still runs faster than a pure ruby heap for the repeated push/pop scenario, all the way up to N as high as 87k:
== push N (N=87381) ======================================================
push N (c_dheap): 92.8 i/s
push N (rb_heap): 43.5 i/s  2.13x slower
push N (bsearch): 2.9 i/s  31.70x slower
== push N then pop N (N=87381) ===========================================
push N + pop N (c_dheap): 22.6 i/s
push N + pop N (rb_heap): 5.5 i/s  4.08x slower
push N + pop N (bsearch): 2.9 i/s  7.90x slower
== Push/pop with prefilled queue of size=N (N=87381) ====================
push + pop (c_dheap): 1815277.3 i/s
push + pop (bsearch): 762343.2 i/s  2.38x slower
push + pop (rb_heap): 535913.6 i/s  3.39x slower
push + pop (findmin): 2262.8 i/s  802.24x slower
Profiling
n.b. Array#fetch
is reading the input data, external to heap operations.
These benchmarks use integers for all scores, which enables significantly faster
comparisons. If a <=> b
were used instead, then the difference between push
and pop would be much larger. And ruby's Tracepoint
impacts these different
implementations differently. So we can't use these profiler results for
comparisons between implementations. A sampling profiler would be needed for
more accurate relative measurements.
It's informative to look at the rubyprof
results for a simple binary search +
insert implementation, repeatedly pushing and popping to a large heap. In
particular, even with 1000 members, the linear Array#insert
is still faster
than the logarithmic Array#bsearch_index
. At this scale, ruby comparisons are
still (relatively) slow and memcpy
is (relatively) quite fast!
%self total self wait child calls name location
34.79 2.222 2.222 0.000 0.000 1000000 Array#insert
32.59 2.081 2.081 0.000 0.000 1000000 Array#bsearch_index
12.84 6.386 0.820 0.000 5.566 1 DHeap::Benchmarks::Scenarios#repeated_push_pop d_heap/benchmarks.rb:77
10.38 4.966 0.663 0.000 4.303 1000000 DHeap::Benchmarks::BinarySearchAndInsert#<< d_heap/benchmarks/implementations.rb:61
5.38 0.468 0.343 0.000 0.125 1000000 DHeap::Benchmarks::BinarySearchAndInsert#pop d_heap/benchmarks/implementations.rb:70
2.06 0.132 0.132 0.000 0.000 1000000 Array#fetch
1.95 0.125 0.125 0.000 0.000 1000000 Array#pop
Contrast this with a simplistic pureruby implementation of a binary heap:
%self total self wait child calls name location
48.52 8.487 8.118 0.000 0.369 1000000 DHeap::Benchmarks::NaiveBinaryHeap#pop d_heap/benchmarks/implementations.rb:96
42.94 7.310 7.184 0.000 0.126 1000000 DHeap::Benchmarks::NaiveBinaryHeap#<< d_heap/benchmarks/implementations.rb:80
4.80 16.732 0.803 0.000 15.929 1 DHeap::Benchmarks::Scenarios#repeated_push_pop d_heap/benchmarks.rb:77
You can see that it spends almost more time in pop than it does in push. That is expected behavior for a heap: although both are O(log n), pop is significantly more complex, and has d comparisons per layer.
And DHeap
shows a similar comparison between push and pop, although it spends
half of its time in the benchmark code (which is written in ruby):
%self total self wait child calls name location
43.09 1.685 0.726 0.000 0.959 1 DHeap::Benchmarks::Scenarios#repeated_push_pop d_heap/benchmarks.rb:77
26.05 0.439 0.439 0.000 0.000 1000000 DHeap#<<
23.57 0.397 0.397 0.000 0.000 1000000 DHeap#pop
7.29 0.123 0.123 0.000 0.000 1000000 Array#fetch
Timers
Additionally, when used to sort timers, we can reasonably assume that:
 New timers usually sort after most existing timers.
 Most timers will be canceled before executing.
 Canceled timers usually sort after most existing timers.
So, if we are able to delete an item without searching for it, by keeping a map of positions within the heap, most timers can be inserted and deleted in O(1) time. Canceling a nonleaf timer can be further optimized by marking it as canceled without immediately removing it from the heap. If the timer is rescheduled before we garbage collect, adjusting its position will usually be faster than a delete and reinsert.
Alternative data structures
As always, you should run benchmarks with your expected scenarios to determine which is right.
Depending on what you're doing, maintaining a sorted Array
using
#bsearch_index
and #insert
might be just fine! As discussed above, although
it is O(n)
for insertions, memcpy
is so fast on modern hardware that this
may not matter. Also, if you can arrange for insertions to occur near the end
of the array, that could significantly reduce the memcpy
overhead even more.
More complex heap varients, e.g. Fibonacci heap, can allow heaps to be merged as well as lower amortized time.
If it is important to be able to quickly enumerate the set or find the ranking of values in it, then you may want to use a selfbalancing binary search tree (e.g. a redblack tree) or a skiplist.
Hashed and Heirarchical Timing Wheels (or some variant in that
family of data structures) can be constructed to have effectively O(1)
running
time in most cases. Although the implementation for that data structure is more
complex than a heap, it may be necessary for enormous values of N.
TODOs...
TODO: Also included is will include DHeap::Set
, which augments the
basic heap with an internal Hash
, which maps a set of values to scores.
loosely inspired by go's timers. e.g: It lazily sifts its heap after deletion
and adjustments, to achieve faster average runtime for add and cancel
operations.
TODO: Also included is will include DHeap::Lazy
, which contains some
features that are loosely inspired by go's timers. e.g: It lazily sifts its
heap after deletion and adjustments, to achieve faster average runtime for add
and cancel operations.
Additionally, I was inspired by reading go's "timer.go" implementation to experiment with a 4ary heap instead of the traditional binary heap. In the case of timers, new timers are usually scheduled to run after most of the existing timers. And timers are usually canceled before they have a chance to run. While a binary heap holds 50% of its elements in its last layer, 75% of a 4ary heap will have no children. That diminishes the extra comparison overhead during siftdown.
Development
After checking out the repo, run bin/setup
to install dependencies. Then, run
rake spec
to run the tests. You can also run bin/console
for an interactive
prompt that will allow you to experiment.
To install this gem onto your local machine, run bundle exec rake install
. To
release a new version, update the version number in version.rb
, and then run
bundle exec rake release
, which will create a git tag for the version, push
git commits and tags, and push the .gem
file to
rubygems.org.
Contributing
Bug reports and pull requests are welcome on GitHub at https://github.com/nevans/d_heap. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the code of conduct.
License
The gem is available as open source under the terms of the MIT License.
Code of Conduct
Everyone interacting in the DHeap project's codebases, issue trackers, chat rooms and mailing lists is expected to follow the code of conduct.