0.0
No release in over 3 years
DataShifter: backfills and one-off fixes as rake tasks. Dry run by default, auto rollback, progress bars, consistent summaries.
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
 Dependencies

Runtime

>= 0.1.0.pre.alpha.4, < 0.2.0
>= 7.0
>= 3.18
 Project Readme

DataShifter

Rake-backed data migrations ("shifts") for Rails apps, with dry run by default, progress output, and a consistent summary. Define shift classes in lib/data_shifts/*.rb; run them as rake data:shift:<task_name>.

Installation

# Gemfile
gem "data_shifter"
bundle install

No extra setup in a Rails app: the railtie registers the generator and defines rake tasks by scanning lib/data_shifts/*.rb.

Quickstart

Generate a shift (optionally scoped to a model):

bin/rails generate data_shift backfill_foo
bin/rails generate data_shift backfill_users --model User

Add your logic to the generated file in lib/data_shifts/.

Run it:

rake data:shift:backfill_foo
COMMIT=1 rake data:shift:backfill_foo

Defining a shift

Typical shifts implement:

  • collection: an ActiveRecord::Relation (uses find_each) or an Array/Enumerable
  • process_record(record): applies the change for one record
module DataShifts
  class BackfillCanceledById < DataShifter::Shift
    description "Backfill canceled_by_id"

    def collection
      Bar.where(canceled_by_id: nil).where.not(canceled_at: nil)
    end

    def process_record(bar)
      bar.update!(canceled_by_id: bar.company.primary_contact_id)
    end
  end
end

Dry run vs commit

Shifts run in dry run mode by default. In the automatic transaction modes (transaction :single / true, and transaction :per_record), DB changes are rolled back automatically.

  • Dry run (default): rake data:shift:backfill_foo
  • Commit: COMMIT=1 rake data:shift:backfill_foo
    • (COMMIT=true or DRY_RUN=false also commit)

Automatic side-effect guards (dry run)

In dry run mode, DataShifter automatically blocks or fakes these side effects so unguarded code is less likely to hit the network or send mail/jobs:

Service Behavior in dry run
HTTP Blocked via WebMock (disable_net_connect!). Allow specific hosts with allow_external_requests [...] or DataShifter.config.allow_external_requests.
ActionMailer perform_deliveries = false (restored after run).
ActiveJob Queue adapter set to :test (restored after run).
Sidekiq Sidekiq::Testing.fake! (restored with disable! after run). Only applied if Sidekiq::Testing is already loaded.

Guarding other side effects: For anything we don't cover (e.g. another service, or allowed HTTP that mutates), use e.g. return if dry_run? in your shift. DB changes are always rolled back in dry run; only non-DB side effects need this.

To allow HTTP to specific hosts during dry run (e.g. a migration that must call an API to compute values), use the per-shift DSL or global config (NOTE: it is your responsibility to ensure you only make readonly requests in dry_run? mode):

# Per shift
module DataShifts
  class BackfillFromApi < DataShifter::Shift
    allow_external_requests ["api.readonly.example.com", %r{\.internal\.company\z}]
    # ...
  end
end
# Global (e.g. in config/initializers/data_shifter.rb)
DataShifter.configure do |config|
  config.allow_external_requests = ["api.readonly.example.com"]
end

Allowed hosts are combined (per-shift + global). Restore (WebMock, mail, jobs) happens in an ensure so later code and other specs are unaffected.

Transaction modes

Set the transaction mode at the class level:

  • transaction :single / transaction true (default): one DB transaction for the entire run; dry run rolls back at the end; a record error aborts the run.
  • transaction :per_record: in commit mode, each record runs in its own transaction (errors are collected and the run continues); in dry run, the run is wrapped in a single rollback transaction.
  • transaction false / transaction :none: No automatic transaction in commit mode only. In dry run, the run is still wrapped in a single rollback transaction so DB changes are never committed. Use when you have external side effects or your own transaction strategy in commit mode.
module DataShifts
  class BackfillLegacyId < DataShifter::Shift
    description "Per-record so one failure doesn't roll back all"
    transaction :per_record

    def collection = Item.where(legacy_id: nil)
    def process_record(item)
      item.update!(legacy_id: LegacyIdService.fetch(item))
    end
  end
end
module DataShifts
  class SyncToExternal < DataShifter::Shift
    description "Side effects outside DB"
    transaction false

    def process_record(record)
      return if dry_run?

      record.update!(synced_at: Time.current)
      ExternalAPI.notify(record)
    end
  end
end

Progress, status, and output

  • Progress bar: enabled by default (requires ruby-progressbar), and only shown for collections with at least 5 records.
  • Header: prints mode (DRY RUN vs LIVE), record count, transaction mode, and available status triggers.
  • Live status (without aborting):
    • STATUS_INTERVAL=60 prints a status block periodically (checked between records)
    • macOS/BSD: Ctrl+T (SIGINFO)
    • Any OS: kill -USR1 <pid> (SIGUSR1)

Resuming a partial run (CONTINUE_FROM)

If your collection is an ActiveRecord::Relation, you can resume by filtering the primary key:

CONTINUE_FROM=123 COMMIT=1 rake data:shift:backfill_foo

Notes:

  • Only supported for ActiveRecord::Relation collections (Array-based collections—like those from find_exactly!—cannot be resumed).
  • The filter is primary_key > CONTINUE_FROM, so it's only useful with monotonically increasing primary keys (e.g. find_each's default behavior).

How shift files map to rake tasks

DataShifter defines one rake task per file in lib/data_shifts/*.rb.

  • Task name: derived from the filename with any leading digits removed.
    • 20260201120000_backfill_foo.rbdata:shift:backfill_foo (leading <digits>_ prefix is stripped)
    • backfill_foo.rbdata:shift:backfill_foo
  • Class name: task name camelized, inside the DataShifts module.
    • backfill_fooDataShifts::BackfillFoo

Shift files are required only when the task runs (tasks are defined up front; classes load lazily). The description "..." line is extracted from the file and used for rake -T output without loading the shift class.

Configuration

Configure DataShifter globally in an initializer:

# config/initializers/data_shifter.rb
DataShifter.configure do |config|
  # Hosts allowed for HTTP during dry run only (no effect in commit mode)
  config.allow_external_requests = ["api.readonly.example.com"]

  # Suppress repeated log messages during a shift run (default: true)
  config.suppress_repeated_logs = true

  # Max unique messages to track for deduplication (default: 1000)
  config.repeated_log_cap = 1000

  # Global default for progress bar visibility (default: true)
  config.progress_enabled = true

  # Default status print interval in seconds when ENV STATUS_INTERVAL is not set (default: nil)
  config.status_interval_seconds = nil
end

Per-shift overrides:

class MyShift < DataShifter::Shift
  progress false                # Disable progress bar for this shift
  suppress_repeated_logs false  # Disable log deduplication for this shift
end

Operational tips

Safety checklist (recommended)

  • Start with a dry run: run the task once with no environment variables set, confirm logs and summary look right, then re-run with COMMIT=1.
  • Make shifts idempotent: structure process_record so re-running is safe (for example, update only when the target column is NULL, or compute the same derived value deterministically).
  • Guard side effects we don't auto-block: use return if dry_run? for any side effect not covered by Automatic side-effect guards (see above).

Choosing a transaction mode (behavior + guidance)

  • transaction :single (default):
    • Behavior: the first raised error aborts the run (all-or-nothing).
    • Use when: partial success is worse than failure, or you want a clean rollback on any unexpected error.
  • transaction :per_record:
    • Behavior: in commit mode, records are committed one-by-one; errors are collected and the run continues; the overall run fails at the end if any record failed.
    • Use when: you want maximum progress and are OK investigating/fixing a subset of failures.
  • transaction false / :none:
    • Behavior: in commit mode, no automatic transaction; in dry run, the run is still wrapped in a rollback transaction so DB changes are not committed.
    • Use when: you have intentional external side effects or your own transaction/locking strategy in commit mode.

Performance and operability (recommended)

  • Prefer returning an ActiveRecord::Relation from collection for large datasets (DataShifter iterates relations with find_each).
  • Be aware count happens up front for relations to print the header and size the progress bar. On very large/expensive relations, that extra query may be non-trivial.
  • Use status output for long runs: set STATUS_INTERVAL in environments where signals are awkward (for example, some process managers).

Utilities for building shifts

find_exactly! (fail fast for ID lists)

Use find_exactly!(Model, ids) to fetch a fixed list and raise if any are missing:

def collection
  ids = ENV.fetch("BUYBACK_IDS").split(",").map(&:strip)
  find_exactly!(Buyback, ids)
end

def process_record(buyback)
  buyback.recompute!
end

skip! (count but don't update)

Mark a record as skipped. Calling skip! terminates the current process_record immediately (no return needed). The record is counted as "Skipped" in the summary.

def process_record(record)
  skip!("already done") if record.foo.present?
  record.update!(foo: value)  # not executed if skipped
end

Skip reasons are grouped: the summary shows the top 10 reasons by count (e.g. "already done" (42), "not eligible" (3)) instead of logging each skip inline. This keeps the progress bar clean.

Throttling and disabling the progress bar

class SomeShift < DataShifter::Shift
  throttle 0.1       # sleep seconds between records
  progress false    # disable progress bar rendering
end

Generator

Command Generates
bin/rails generate data_shift backfill_foo lib/data_shifts/<timestamp>_backfill_foo.rb with a DataShifts::BackfillFoo class
bin/rails generate data_shift backfill_users --model User Same, with User.all in collection and process_record(user)
bin/rails generate data_shift backfill_users --spec Also generates spec/lib/data_shifts/backfill_users_spec.rb when RSpec is enabled

The generator refuses to create a second shift if it would produce a duplicate rake task name.

Testing shifts (RSpec)

This gem ships a small helper module for running shifts in tests. Require it and include DataShifter::SpecHelper in specs or in RSpec.configure for type: :data_shift.

Helpers:

  • run_data_shift(shift_class, dry_run: true, commit: false) — Runs the shift; returns an Axn::Result. Use commit: true to run in commit mode.
  • silence_data_shift_output — Suppresses STDOUT for the block (e.g. progress bar).
  • capture_data_shift_output — Runs the block and returns [result, output_string] for asserting on printed output.

Use expect { ... }.not_to change(...) and expect { ... }.to change(...) to assert that data stays unchanged in dry run and changes when committed:

require "data_shifter/spec_helper"

RSpec.describe DataShifts::BackfillFoo do
  include DataShifter::SpecHelper

  before { allow($stdout).to receive(:puts) }

  it "does not persist changes in dry run" do
    expect do
      result = run_data_shift(described_class, dry_run: true)
      expect(result).to be_ok
    end.not_to change(Foo, :count)
  end

  it "persists changes when committed" do
    expect do
      result = run_data_shift(described_class, commit: true)
      expect(result).to be_ok
    end.to change(Foo, :count).by(1)
    # Or for in-place updates: .to change { record.reload.bar }.from(nil).to("baz")
  end
end

Requirements

  • Ruby ≥ 3.2.1
  • Rails (ActiveRecord, ActiveSupport, Railties) ≥ 7.0
  • axn (Shift classes include Axn)
  • ruby-progressbar (for progress bars)
  • webmock (for dry-run HTTP blocking; optional allowlist via allow_external_requests [...] / DataShifter.config.allow_external_requests)