The project is in a healthy, maintained state
Memory-efficient scalable Bloom Filter that grows dynamically. No upfront capacity needed. 20-50x less memory than Set, perfect for Rails apps.
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
 Dependencies

Development

~> 2.0
~> 5.0
~> 13.0
 Project Readme

FastBloomFilter v2 πŸš€

Gem Version

A scalable Bloom Filter implementation in C for Ruby. Grows automatically without requiring upfront capacity! Perfect for Rails applications that need memory-efficient set membership testing with unknown dataset sizes.

What's New in v2? πŸŽ‰

  • πŸ”„ Scalable Architecture: No need to guess capacity upfront - the filter grows automatically
  • πŸ“Š Multi-Layer System: Adds new layers dynamically as data grows
  • 🎯 Smart Growth: Growth factor adapts (2x β†’ 1.75x β†’ 1.5x β†’ 1.25x)
  • πŸ’‘ Simpler API: Just specify error rate, not capacity
  • πŸ“ˆ Better for Unknown Sizes: Perfect when you don't know how much data you'll have

Based on "Scalable Bloom Filters" (Almeida et al., 2007)

Features

  • πŸš€ Fast: C implementation with MurmurHash3
  • πŸ’Ύ Memory Efficient: 20-50x less memory than Ruby Set
  • πŸ”„ Auto-Scaling: Grows dynamically as you add elements
  • 🎯 Configurable: Adjustable false positive rate per layer
  • πŸ“Š Statistics: Detailed per-layer performance monitoring
  • βœ… Well-Tested: Comprehensive test suite

Installation

Add this line to your application's Gemfile:

gem 'fast_bloom_filter'

And then execute:

bundle install

Or install it yourself as:

gem install fast_bloom_filter

Usage

Basic Operations (v2 API)

require 'fast_bloom_filter'

# Create a scalable filter - NO CAPACITY NEEDED!
# Just specify your desired error rate
bloom = FastBloomFilter::Filter.new(error_rate: 0.01)

# Or with an initial capacity hint (optional)
bloom = FastBloomFilter::Filter.new(error_rate: 0.01, initial_capacity: 1000)

# Add items - filter grows automatically
bloom.add("user@example.com")
bloom << "another@example.com"  # alias for add

# Check membership
bloom.include?("user@example.com")  # => true
bloom.include?("notfound@test.com") # => false (probably)

# Add thousands or millions - it scales!
100_000.times { |i| bloom.add("user#{i}@test.com") }

# Check stats
bloom.count        # => 100002
bloom.num_layers   # => 8 (grew automatically!)

# Batch operations
emails = ["user1@test.com", "user2@test.com", "user3@test.com"]
bloom.add_all(emails)

# Count possible matches
bloom.count_possible_matches(["user1@test.com", "unknown@test.com"])  # => 1 or 2

# Clear all items
bloom.clear

Helper Methods

# For email deduplication (0.1% false positive rate)
bloom = FastBloomFilter.for_emails(error_rate: 0.001)

# For URL tracking (1% false positive rate)
bloom = FastBloomFilter.for_urls(error_rate: 0.01)

# With initial capacity hint
bloom = FastBloomFilter.for_emails(error_rate: 0.001, initial_capacity: 10_000)

Merge Filters

bloom1 = FastBloomFilter::Filter.new(error_rate: 0.01)
bloom2 = FastBloomFilter::Filter.new(error_rate: 0.01)

bloom1.add("item1")
bloom2.add("item2")

bloom1.merge!(bloom2)  # bloom1 now contains both items
# Merges all layers from bloom2 into bloom1

Statistics

bloom = FastBloomFilter::Filter.new(error_rate: 0.01)
1000.times { |i| bloom.add("item#{i}") }

stats = bloom.stats

# => {
#   total_count: 1000,
#   num_layers: 2,
#   total_bytes: 2500,
#   total_bits: 20000,
#   total_bits_set: 6543,
#   fill_ratio: 0.32715,
#   error_rate: 0.01,
#   layers: [
#     {
#       layer: 0,
#       capacity: 1024,
#       count: 1024,
#       size_bytes: 1229,
#       num_hashes: 7,
#       bits_set: 5234,
#       total_bits: 9832,
#       fill_ratio: 0.532,
#       error_rate: 0.0015
#     },
#     # ... more layers
#   ]
# }

puts bloom.inspect
# => #<FastBloomFilter::Filter v2 layers=2 count=1000 size=2.44KB fill=32.72%>

How Scalable Bloom Filters Work

Traditional Bloom Filters require you to specify capacity upfront. Scalable Bloom Filters solve this by:

  1. Starting Small: Begin with a small initial capacity (default: 1024 elements)
  2. Adding Layers: When a layer fills up, add a new layer with larger capacity
  3. Tightening Error Rates: Each new layer has a tighter error rate to maintain overall FPR
  4. Smart Growth: Growth factor decreases over time (2x β†’ 1.75x β†’ 1.5x β†’ 1.25x)

Error Rate Distribution

Each layer i gets error rate: total_error_rate Γ— (1 - r) Γ— r^i

Where r is the tightening factor (default: 0.85). This ensures the sum of all layer error rates converges to your target error rate.

Example Growth Pattern

Layer 0: capacity=1,024   error_rate=0.0015  (initial)
Layer 1: capacity=2,048   error_rate=0.0013  (2x growth)
Layer 2: capacity=3,584   error_rate=0.0011  (1.75x growth)
Layer 3: capacity=5,376   error_rate=0.0009  (1.5x growth)
Layer 4: capacity=6,720   error_rate=0.0008  (1.25x growth)
...

Performance

Benchmarks on MacBook Pro M1 (100K elements):

Operation Bloom Filter v2 Ruby Set Speedup
Add 48ms 120ms 2.5x
Check 9ms 15ms 1.7x
Memory 145KB 2000KB 13.8x

Run benchmarks yourself:

ruby demo.rb

Use Cases

Rails: Prevent Duplicate Email Signups (No Capacity Guessing!)

class User < ApplicationRecord
  # No need to guess how many users you'll have!
  SIGNUP_BLOOM = FastBloomFilter.for_emails(error_rate: 0.001)
  
  before_validation :check_duplicate_signup
  
  private
  
  def check_duplicate_signup
    if SIGNUP_BLOOM.include?(email)
      errors.add(:email, "may already be registered")
      return false
    end
    SIGNUP_BLOOM.add(email)
  end
end

Track Visited URLs (Scales to Millions)

class WebCrawler
  def initialize
    # Starts small, grows as needed
    @visited = FastBloomFilter.for_urls(error_rate: 0.01)
  end
  
  def crawl(url)
    return if @visited.include?(url)
    
    @visited.add(url)
    # ... crawl logic
    
    # Check growth
    if @visited.count % 10_000 == 0
      puts "Crawled #{@visited.count} URLs, #{@visited.num_layers} layers"
    end
  end
end

Cache Key Deduplication

class CacheWarmer
  def initialize
    @warmed = FastBloomFilter::Filter.new(error_rate: 0.001)
  end
  
  def warm(key)
    return if @warmed.include?(key)
    
    Rails.cache.fetch(key) { expensive_operation(key) }
    @warmed.add(key)
  end
end

Migration from v1.x

v1.x (Fixed Capacity):

bloom = FastBloomFilter::Filter.new(10_000, 0.01)
bloom = FastBloomFilter.for_emails(100_000)

v2.x (Scalable):

# Recommended: Let it scale automatically
bloom = FastBloomFilter::Filter.new(error_rate: 0.01)

# Or with initial capacity hint
bloom = FastBloomFilter::Filter.new(error_rate: 0.01, initial_capacity: 1000)

# Helper methods also changed
bloom = FastBloomFilter.for_emails(error_rate: 0.001, initial_capacity: 10_000)

Development

# Clone the repository
git clone https://github.com/roman-haidarov/fast_bloom_filter.git
cd fast_bloom_filter

# Install dependencies
bundle install

# Compile the C extension
bundle exec rake compile

# Run tests
bundle exec rake test

# Build the gem
gem build fast_bloom_filter.gemspec

# Install locally
gem install ./fast_bloom_filter-2.0.0.gem

Quick Build Script

./build.sh

Requirements

  • Ruby >= 2.7.0
  • C compiler (gcc, clang, etc.)
  • Make

Technical Details

  • Hash Function: MurmurHash3 (32-bit)
  • Bit Array: Dynamic allocation per layer
  • Growth Strategy: Adaptive (2x β†’ 1.75x β†’ 1.5x β†’ 1.25x)
  • Tightening Factor: 0.85 (configurable)
  • Memory Management: Ruby GC integration with proper cleanup
  • Thread Safety: Safe for concurrent reads (writes need external synchronization)

Contributing

  1. Fork it
  2. Create your feature branch (git checkout -b feature/my-new-feature)
  3. Commit your changes (git commit -am 'Add some feature')
  4. Push to the branch (git push origin feature/my-new-feature)
  5. Create a new Pull Request

License

The gem is available as open source under the terms of the MIT License.

Credits

  • Scalable Bloom Filters algorithm: Almeida, Baquero, PreguiΓ§a, Hutchison (2007)
  • MurmurHash3 implementation: Austin Appleby
  • Original Bloom Filter: Burton Howard Bloom (1970)

Support

Changelog

See CHANGELOG.md for version history.