Sidekiq Assured Jobs

Reliable job execution guarantee for Sidekiq with automatic orphan recovery.

Overview

Sidekiq Assured Jobs ensures that your critical Sidekiq jobs are never lost due to worker crashes, pod restarts, or unexpected shutdowns. It provides a robust tracking system that monitors in-flight jobs and automatically recovers any work that was interrupted.

Perfect for:

Critical business processes that cannot be lost
Financial transactions and payment processing
Data synchronization and ETL operations
Email delivery and notification systems
Any job where reliability is paramount

Key Features

🛡️ Job Assurance: Guarantees that tracked jobs will complete or be automatically retried
🔄 Automatic Recovery: Detects and re-enqueues orphaned jobs from crashed workers
⏰ Delayed Recovery: Configurable additional recovery passes for enhanced reliability
🖥️ Web Dashboard: Monitor and manage orphaned jobs through Sidekiq's web interface
⚡ Zero Configuration: Works out of the box with sensible defaults
🏗️ Production Ready: Designed for high-throughput production environments
🔗 Sidekiq Integration: Uses Sidekiq's existing Redis connection pool
🔒 Distributed Locking: Prevents duplicate recovery operations
📊 Minimal Overhead: Lightweight tracking with configurable heartbeat intervals

The Problem

When Sidekiq workers crash or are forcefully terminated (SIGKILL), jobs that were being processed are lost forever:

sequenceDiagram
    participant Client
    participant Queue as Sidekiq Queue
    participant Worker as Worker Process
    participant Redis

    Client->>Queue: Enqueue Critical Job
    Queue->>Worker: Fetch Job
    Worker->>Redis: Job starts processing
    Note over Worker: Worker crashes (SIGKILL)
    Worker--xRedis: Job lost forever
    Note over Queue: No retry, no error handling
    Note over Client: Critical work never completed

The Solution

Sidekiq Assured Jobs tracks in-flight jobs and automatically recovers them:

graph TB
    subgraph Cluster["Production Environment"]
        subgraph W1["Worker Instance 1"]
            SW1[Sidekiq Worker]
            HB1[Heartbeat]
            MW1[Tracking Middleware]
        end

        subgraph W2["Worker Instance 2"]
            SW2[Sidekiq Worker]
            HB2[Heartbeat]
            MW2[Tracking Middleware]
        end
    end

    subgraph Redis["Redis Storage"]
        HK["Heartbeats<br/>instance:worker-1<br/>instance:worker-2"]
        JT["Job Tracking<br/>jobs:worker-1<br/>jobs:worker-2"]
        JP["Job Payloads<br/>job:abc123<br/>job:def456"]
        RL["Recovery Lock"]
    end

    Queue[Sidekiq Queue]

    HB1 -->|Every 15s| HK
    HB2 -->|Every 15s| HK

    MW1 -->|Track Start/End| JT
    MW1 -->|Store Payload| JP
    MW2 -->|Track Start/End| JT
    MW2 -->|Store Payload| JP

    SW2 -->|On Startup| RL
    SW2 -->|Detect Orphans| JT
    SW2 -->|Re-enqueue| Queue

    style HK fill:#e8f5e8
    style JT fill:#fff3e0
    style JP fill:#e3f2fd
    style RL fill:#ffebee

Installation

Add this line to your application's Gemfile:

gem 'sidekiq-assured-jobs'

And then execute:

bundle install

Quick Start

1. Basic Setup

The gem auto-configures itself when required:

# In your application (e.g., config/application.rb or config/initializers/sidekiq.rb)
require 'sidekiq-assured-jobs'

2. Enable Job Tracking

Include the AssuredJobs::Worker module in workers you want to track:

class PaymentProcessor
  include Sidekiq::Worker
  include Sidekiq::AssuredJobs::Worker  # Enables job assurance

  def perform(payment_id, amount)
    # This job will be tracked and recovered if the worker crashes
    process_payment(payment_id, amount)
  end
end

class LogCleanupWorker
  include Sidekiq::Worker
  # No AssuredJobs::Worker - not tracked (fine for non-critical work)

  def perform
    # This job won't be tracked
    cleanup_old_logs
  end
end

3. That's It!

Your critical jobs are now protected. If a worker crashes while processing a tracked job, another worker will automatically detect and re-enqueue it.

Web Interface

Sidekiq Assured Jobs includes a web dashboard that integrates seamlessly with Sidekiq's existing web interface. The dashboard allows you to monitor and manage orphaned jobs in real-time.

Setup

The web interface is automatically available when you mount Sidekiq::Web in your application:

# config/routes.rb (Rails)
require 'sidekiq/web'
mount Sidekiq::Web => '/sidekiq'

Or for standalone applications:

# config.ru
require 'sidekiq/web'
run Sidekiq::Web

Features

The Orphaned Jobs tab provides:

📊 Real-time Dashboard: View all orphaned jobs with key information
🔍 Job Details: Detailed view of individual orphaned jobs including arguments and error information
🔄 Manual Recovery: Retry orphaned jobs individually or in bulk
🗑️ Job Management: Delete orphaned jobs that are no longer needed
📈 Instance Monitoring: Track the status of worker instances (alive/dead)
⏱️ Auto-refresh: Dashboard automatically updates every 30 seconds
🎯 Bulk Operations: Select multiple jobs for batch retry or delete operations

Dashboard Information

For each orphaned job, the dashboard displays:

Job ID: Unique identifier for the job
Class: The worker class name
Queue: The queue the job was running in
Instance: The worker instance that was processing the job
Orphaned Time: When the job became orphaned
Duration: How long the job has been orphaned
Arguments: The job's input parameters
Error Information: Any error details if the job failed

Actions Available

Retry: Re-enqueue the job for processing
Delete: Remove the job from tracking (cannot be undone)
Bulk Retry: Retry multiple selected jobs at once
Bulk Delete: Delete multiple selected jobs at once

The web interface provides a user-friendly way to monitor your job reliability and take action when needed, complementing the automatic recovery system.

Demo

To see the web interface in action, run the included demo:

ruby examples/web_demo.rb

Then visit http://localhost:4567/orphaned-jobs to explore the dashboard with sample orphaned jobs.

Configuration

The gem works with zero configuration but provides extensive customization options. See the Complete Configuration Reference below for all available options.

Complete Configuration Reference

Core Configuration Options

Option	Environment Variable	Default	Description
`instance_id`	`ASSURED_JOBS_INSTANCE_ID`	Auto-generated	Unique identifier for this worker instance
`namespace`	`ASSURED_JOBS_NS`	`sidekiq_assured_jobs`	Redis namespace for all keys
`heartbeat_interval`	`ASSURED_JOBS_HEARTBEAT_INTERVAL`	`15`	Seconds between heartbeat updates
`heartbeat_ttl`	`ASSURED_JOBS_HEARTBEAT_TTL`	`45`	Seconds before instance considered dead
`recovery_lock_ttl`	`ASSURED_JOBS_RECOVERY_LOCK_TTL`	`300`	Seconds to hold recovery lock
`delayed_recovery_interval`	`ASSURED_JOBS_DELAYED_RECOVERY_INTERVAL`	`300`	Seconds between delayed recovery passes
`delayed_recovery_count`	`ASSURED_JOBS_DELAYED_RECOVERY_COUNT`	`1`	Number of delayed recovery passes to run

Configuration Methods

Environment Variables (Recommended for Production)

export ASSURED_JOBS_INSTANCE_ID="worker-pod-1"
export ASSURED_JOBS_NS="myapp_assured_jobs"
export ASSURED_JOBS_HEARTBEAT_INTERVAL="30"
export ASSURED_JOBS_HEARTBEAT_TTL="90"
export ASSURED_JOBS_RECOVERY_LOCK_TTL="600"
export ASSURED_JOBS_DELAYED_RECOVERY_INTERVAL="600"
export ASSURED_JOBS_DELAYED_RECOVERY_COUNT="2"

Programmatic Configuration

Sidekiq::AssuredJobs.configure do |config|
  config.namespace = "myapp_assured_jobs"
  config.heartbeat_interval = 30
  config.heartbeat_ttl = 90
  config.recovery_lock_ttl = 600
  config.delayed_recovery_interval = 600
  config.delayed_recovery_count = 2

  # Advanced: Custom Redis configuration
  config.redis_options = {
    url: ENV['ASSURED_JOBS_REDIS_URL'],
    db: 2,
    timeout: 5
  }
end

Configuration Guidelines

Heartbeat Settings

heartbeat_interval: How often workers send "I'm alive" signals
- Lower values = faster orphan detection, higher Redis load
- Recommended: 15-30 seconds for production
heartbeat_ttl: How long to wait before considering an instance dead
- Should be 2-3x the heartbeat interval
- Accounts for network delays and Redis latency

Recovery Settings

recovery_lock_ttl: How long one instance holds the recovery lock
- Prevents multiple instances from recovering the same jobs
- Should be longer than expected recovery time
delayed_recovery_interval: Time between additional recovery passes
- Provides safety net for missed orphans
- Recommended: 5-10 minutes for most applications
delayed_recovery_count: Number of additional recovery attempts
- Balance between reliability and resource usage
- Recommended: 1-3 passes for most applications

Production Recommendations

High-Availability Setup

Sidekiq::AssuredJobs.configure do |config|
  config.namespace = "#{Rails.application.class.module_parent_name.downcase}_assured_jobs"
  config.heartbeat_interval = 30      # Balanced load vs detection speed
  config.heartbeat_ttl = 90           # 3x heartbeat interval
  config.recovery_lock_ttl = 900      # 15 minutes for large recovery operations
  config.delayed_recovery_interval = 600  # 10 minutes between passes
  config.delayed_recovery_count = 2   # Two additional safety passes
end

Resource-Constrained Environment

Sidekiq::AssuredJobs.configure do |config|
  config.heartbeat_interval = 60     # Reduce Redis load
  config.heartbeat_ttl = 180         # 3x heartbeat interval
  config.delayed_recovery_count = 1  # Single delayed pass
end

Critical Systems (Maximum Reliability)

Sidekiq::AssuredJobs.configure do |config|
  config.heartbeat_interval = 15     # Fast orphan detection
  config.heartbeat_ttl = 45          # Quick failure detection
  config.delayed_recovery_interval = 300  # 5 minutes between passes
  config.delayed_recovery_count = 3  # Three additional passes
end

Delayed Recovery System

In addition to immediate orphan recovery on startup, the gem provides a configurable delayed recovery system for enhanced reliability:

Sidekiq::AssuredJobs.configure do |config|
  # Run 2 additional recovery passes, 10 minutes apart
  config.delayed_recovery_count = 2
  config.delayed_recovery_interval = 600  # 10 minutes
end

How Delayed Recovery Works:

Immediate Recovery: On startup, each worker instance performs immediate orphan recovery
Delayed Passes: After startup, a background thread runs additional recovery passes
Configurable Timing: Control both the interval between passes and total number of passes
Error Resilience: Each delayed recovery pass is wrapped in error handling to prevent thread crashes

Benefits:

Enhanced Reliability: Catches jobs that might be missed during startup recovery
Network Partition Recovery: Handles cases where Redis connectivity issues cause temporary orphaning
Race Condition Mitigation: Provides additional safety net for edge cases
Zero Application Impact: Runs in background threads without affecting job processing

Use Cases:

High-Availability Systems: Where maximum job recovery reliability is critical
Network-Unstable Environments: Where Redis connectivity might be intermittent
Large-Scale Deployments: Where startup recovery might miss some edge cases

Advanced Features

Redis Integration

The gem provides flexible Redis integration options:

Default Configuration (Recommended)

By default, the gem uses Sidekiq's existing Redis connection pool:

# Uses Sidekiq's Redis configuration automatically
Sidekiq::AssuredJobs.configure do |config|
  config.namespace = "my_app_assured_jobs"
end

Custom Redis Configuration (Advanced)

For advanced use cases requiring Redis isolation:

Sidekiq::AssuredJobs.configure do |config|
  config.namespace = "my_app_assured_jobs"
  config.redis_options = {
    url: ENV['ASSURED_JOBS_REDIS_URL'],
    db: 2,
    timeout: 5
  }
end

Benefits

Connection Efficiency: Reuses Sidekiq's connection pool by default
Custom Namespacing: Efficient key prefixing without external dependencies
Configuration Consistency: Inherits Sidekiq's Redis settings
Flexible Options: Support for custom Redis when needed

SidekiqUniqueJobs Integration

The gem automatically integrates with sidekiq-unique-jobs to ensure orphaned unique jobs can be recovered immediately:

class UniquePaymentProcessor
  include Sidekiq::Worker
  include Sidekiq::AssuredJobs::Worker

  sidekiq_options unique: :until_executed

  def perform(payment_id)
    # This job will be tracked and can be recovered even with unique constraints
    process_payment(payment_id)
  end
end

Benefits:

Immediate Recovery: Orphaned unique jobs are re-enqueued immediately (no waiting period)
Automatic Detection: Works seamlessly whether SidekiqUniqueJobs is present or not
Surgical Precision: Only clears locks for confirmed orphaned jobs
Error Resilience: Continues operation even if lock clearing fails

Production Deployment

Kubernetes Example

apiVersion: apps/v1
kind: Deployment
metadata:
  name: sidekiq-workers
spec:
  replicas: 3
  template:
    spec:
      containers:
      - name: worker
        image: myapp:latest
        env:
        - name: ASSURED_JOBS_INSTANCE_ID
          valueFrom:
            fieldRef:
              fieldPath: metadata.name  # Use pod name as instance ID
        - name: ASSURED_JOBS_NS
          value: "myapp_assured_jobs"
        - name: ASSURED_JOBS_HEARTBEAT_INTERVAL
          value: "30"
        - name: ASSURED_JOBS_HEARTBEAT_TTL
          value: "90"
        - name: ASSURED_JOBS_RECOVERY_LOCK_TTL
          value: "600"
        - name: ASSURED_JOBS_DELAYED_RECOVERY_INTERVAL
          value: "600"  # 10 minutes
        - name: ASSURED_JOBS_DELAYED_RECOVERY_COUNT
          value: "2"

Docker Compose Example

version: '3.8'
services:
  worker:
    image: myapp:latest
    environment:
      - ASSURED_JOBS_INSTANCE_ID=${HOSTNAME}
      - ASSURED_JOBS_NS=myapp_assured_jobs
      - ASSURED_JOBS_HEARTBEAT_INTERVAL=30
      - ASSURED_JOBS_HEARTBEAT_TTL=90
      - ASSURED_JOBS_RECOVERY_LOCK_TTL=600
      - ASSURED_JOBS_DELAYED_RECOVERY_INTERVAL=600
      - ASSURED_JOBS_DELAYED_RECOVERY_COUNT=2
    deploy:
      replicas: 3

How It Works

Instance Registration: Each worker instance generates a unique ID and sends periodic heartbeats to Redis
Job Tracking: When a tracked job starts, the middleware records the job ID and payload in Redis
Job Cleanup: When a job completes (success or failure), tracking data is removed
Immediate Recovery: On startup, workers check for jobs tracked by dead instances (no recent heartbeat)
Safe Recovery: Using distributed locking, one worker re-enqueues orphaned jobs back to Sidekiq
Delayed Recovery: Background threads run additional recovery passes at configurable intervals
Cleanup: Orphaned tracking data is removed after successful re-enqueuing

Use Cases

Financial Services

class PaymentProcessor
  include Sidekiq::Worker
  include Sidekiq::AssuredJobs::Worker

  def perform(payment_id, amount)
    # Critical: Payment must be processed
    process_payment(payment_id, amount)
  end
end

Data Synchronization

class DataSyncWorker
  include Sidekiq::Worker
  include Sidekiq::AssuredJobs::Worker

  def perform(sync_batch_id)
    # Important: Data consistency depends on completion
    sync_data_batch(sync_batch_id)
  end
end

Email Delivery

class CriticalEmailWorker
  include Sidekiq::Worker
  include Sidekiq::AssuredJobs::Worker

  def perform(email_id)
    # Must deliver: Password resets, order confirmations, etc.
    deliver_critical_email(email_id)
  end
end

Testing

Run the test suite:

bundle exec rspec

Dependencies

Runtime Dependencies

sidekiq (>= 6.0, < 7)
redis (~> 4.0)

Development Dependencies

rspec (~> 3.0)
bundler (~> 2.0)
rubocop (~> 1.0)

Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/example/sidekiq-assured-jobs.

License

The gem is available as open source under the terms of the MIT License.