DataShifter
Rake-backed data migrations ("shifts") for Rails apps, with dry run by default, progress output, and a consistent summary. Define shift classes in lib/data_shifts/*.rb; run them as rake data:shift:<task_name>.
Installation
# Gemfile
gem "data_shifter"bundle installNo extra setup in a Rails app: the railtie registers the generator and defines rake tasks by scanning lib/data_shifts/*.rb.
Quickstart
Generate a shift (optionally scoped to a model):
bin/rails generate data_shift backfill_foo
bin/rails generate data_shift backfill_users --model UserAdd your logic to the generated file in lib/data_shifts/.
Run it:
rake data:shift:backfill_foo
COMMIT=1 rake data:shift:backfill_fooDefining a shift
Typical shifts implement:
-
collection: anActiveRecord::Relation(usesfind_each) or anArray/Enumerable -
process_record(record): applies the change for one record
module DataShifts
class BackfillCanceledById < DataShifter::Shift
description "Backfill canceled_by_id"
def collection
Bar.where(canceled_by_id: nil).where.not(canceled_at: nil)
end
def process_record(bar)
bar.update!(canceled_by_id: bar.company.primary_contact_id)
end
end
endDry run vs commit
Shifts run in dry run mode by default. In the automatic transaction modes (transaction :single / true, and transaction :per_record), DB changes are rolled back automatically.
-
Dry run (default):
rake data:shift:backfill_foo -
Commit:
COMMIT=1 rake data:shift:backfill_foo- (
COMMIT=trueorDRY_RUN=falsealso commit)
- (
Automatic side-effect guards (dry run)
In dry run mode, DataShifter automatically blocks or fakes these side effects so unguarded code is less likely to hit the network or send mail/jobs:
| Service | Behavior in dry run |
|---|---|
| HTTP | Blocked via WebMock (disable_net_connect!). Allow specific hosts with allow_external_requests [...] or DataShifter.config.allow_external_requests. |
| ActionMailer |
perform_deliveries = false (restored after run). |
| ActiveJob | Queue adapter set to :test (restored after run). |
| Sidekiq |
Sidekiq::Testing.fake! (restored with disable! after run). Only applied if Sidekiq::Testing is already loaded. |
Guarding other side effects: For anything we don't cover (e.g. another service, or allowed HTTP that mutates), use e.g. return if dry_run? in your shift. DB changes are always rolled back in dry run; only non-DB side effects need this.
To allow HTTP to specific hosts during dry run (e.g. a migration that must call an API to compute values), use the per-shift DSL or global config (NOTE: it is your responsibility to ensure you only make readonly requests in dry_run? mode):
# Per shift
module DataShifts
class BackfillFromApi < DataShifter::Shift
allow_external_requests ["api.readonly.example.com", %r{\.internal\.company\z}]
# ...
end
end# Global (e.g. in config/initializers/data_shifter.rb)
DataShifter.configure do |config|
config.allow_external_requests = ["api.readonly.example.com"]
endAllowed hosts are combined (per-shift + global). Restore (WebMock, mail, jobs) happens in an ensure so later code and other specs are unaffected.
Transaction modes
Set the transaction mode at the class level:
-
transaction :single/transaction true(default): one DB transaction for the entire run; dry run rolls back at the end; a record error aborts the run. -
transaction :per_record: in commit mode, each record runs in its own transaction (errors are collected and the run continues); in dry run, the run is wrapped in a single rollback transaction. -
transaction false/transaction :none: No automatic transaction in commit mode only. In dry run, the run is still wrapped in a single rollback transaction so DB changes are never committed. Use when you have external side effects or your own transaction strategy in commit mode.
module DataShifts
class BackfillLegacyId < DataShifter::Shift
description "Per-record so one failure doesn't roll back all"
transaction :per_record
def collection = Item.where(legacy_id: nil)
def process_record(item)
item.update!(legacy_id: LegacyIdService.fetch(item))
end
end
endmodule DataShifts
class SyncToExternal < DataShifter::Shift
description "Side effects outside DB"
transaction false
def process_record(record)
return if dry_run?
record.update!(synced_at: Time.current)
ExternalAPI.notify(record)
end
end
endProgress, status, and output
-
Progress bar: enabled by default (requires
ruby-progressbar), and only shown for collections with at least 5 records. - Header: prints mode (DRY RUN vs LIVE), record count, transaction mode, and available status triggers.
-
Live status (without aborting):
-
STATUS_INTERVAL=60prints a status block periodically (checked between records) -
macOS/BSD:
Ctrl+T(SIGINFO) -
Any OS:
kill -USR1 <pid>(SIGUSR1)
-
Resuming a partial run (CONTINUE_FROM)
If your collection is an ActiveRecord::Relation, you can resume by filtering the primary key:
CONTINUE_FROM=123 COMMIT=1 rake data:shift:backfill_fooNotes:
- Only supported for
ActiveRecord::Relationcollections (Array-based collections—like those fromfind_exactly!—cannot be resumed). - The filter is
primary_key > CONTINUE_FROM, so it's only useful with monotonically increasing primary keys (e.g.find_each's default behavior).
How shift files map to rake tasks
DataShifter defines one rake task per file in lib/data_shifts/*.rb.
-
Task name: derived from the filename with any leading digits removed.
-
20260201120000_backfill_foo.rb→data:shift:backfill_foo(leading<digits>_prefix is stripped) -
backfill_foo.rb→data:shift:backfill_foo
-
-
Class name: task name camelized, inside the
DataShiftsmodule.-
backfill_foo→DataShifts::BackfillFoo
-
Shift files are required only when the task runs (tasks are defined up front; classes load lazily).
The description "..." line is extracted from the file and used for rake -T output without loading the shift class.
Configuration
Configure DataShifter globally in an initializer:
# config/initializers/data_shifter.rb
DataShifter.configure do |config|
# Hosts allowed for HTTP during dry run only (no effect in commit mode)
config.allow_external_requests = ["api.readonly.example.com"]
# Suppress repeated log messages during a shift run (default: true)
config.suppress_repeated_logs = true
# Max unique messages to track for deduplication (default: 1000)
config.repeated_log_cap = 1000
# Global default for progress bar visibility (default: true)
config.progress_enabled = true
# Default status print interval in seconds when ENV STATUS_INTERVAL is not set (default: nil)
config.status_interval_seconds = nil
endPer-shift overrides:
class MyShift < DataShifter::Shift
progress false # Disable progress bar for this shift
suppress_repeated_logs false # Disable log deduplication for this shift
endOperational tips
Safety checklist (recommended)
-
Start with a dry run: run the task once with no environment variables set, confirm logs and summary look right, then re-run with
COMMIT=1. -
Make shifts idempotent: structure
process_recordso re-running is safe (for example, update only when the target column isNULL, or compute the same derived value deterministically). -
Guard side effects we don't auto-block: use
return if dry_run?for any side effect not covered by Automatic side-effect guards (see above).
Choosing a transaction mode (behavior + guidance)
-
transaction :single(default):- Behavior: the first raised error aborts the run (all-or-nothing).
- Use when: partial success is worse than failure, or you want a clean rollback on any unexpected error.
-
transaction :per_record:- Behavior: in commit mode, records are committed one-by-one; errors are collected and the run continues; the overall run fails at the end if any record failed.
- Use when: you want maximum progress and are OK investigating/fixing a subset of failures.
-
transaction false/:none:- Behavior: in commit mode, no automatic transaction; in dry run, the run is still wrapped in a rollback transaction so DB changes are not committed.
- Use when: you have intentional external side effects or your own transaction/locking strategy in commit mode.
Performance and operability (recommended)
-
Prefer returning an
ActiveRecord::Relationfromcollectionfor large datasets (DataShifter iterates relations withfind_each). -
Be aware
counthappens up front for relations to print the header and size the progress bar. On very large/expensive relations, that extra query may be non-trivial. -
Use status output for long runs: set
STATUS_INTERVALin environments where signals are awkward (for example, some process managers).
Utilities for building shifts
find_exactly! (fail fast for ID lists)
Use find_exactly!(Model, ids) to fetch a fixed list and raise if any are missing:
def collection
ids = ENV.fetch("BUYBACK_IDS").split(",").map(&:strip)
find_exactly!(Buyback, ids)
end
def process_record(buyback)
buyback.recompute!
end
skip! (count but don't update)
Mark a record as skipped. Calling skip! terminates the current process_record immediately (no return needed). The record is counted as "Skipped" in the summary.
def process_record(record)
skip!("already done") if record.foo.present?
record.update!(foo: value) # not executed if skipped
endSkip reasons are grouped: the summary shows the top 10 reasons by count (e.g. "already done" (42), "not eligible" (3)) instead of logging each skip inline. This keeps the progress bar clean.
Throttling and disabling the progress bar
class SomeShift < DataShifter::Shift
throttle 0.1 # sleep seconds between records
progress false # disable progress bar rendering
endGenerator
| Command | Generates |
|---|---|
bin/rails generate data_shift backfill_foo |
lib/data_shifts/<timestamp>_backfill_foo.rb with a DataShifts::BackfillFoo class |
bin/rails generate data_shift backfill_users --model User |
Same, with User.all in collection and process_record(user)
|
bin/rails generate data_shift backfill_users --spec |
Also generates spec/lib/data_shifts/backfill_users_spec.rb when RSpec is enabled |
The generator refuses to create a second shift if it would produce a duplicate rake task name.
Testing shifts (RSpec)
This gem ships a small helper module for running shifts in tests. Require it and include DataShifter::SpecHelper in specs or in RSpec.configure for type: :data_shift.
Helpers:
-
run_data_shift(shift_class, dry_run: true, commit: false)— Runs the shift; returns anAxn::Result. Usecommit: trueto run in commit mode. -
silence_data_shift_output— Suppresses STDOUT for the block (e.g. progress bar). -
capture_data_shift_output— Runs the block and returns[result, output_string]for asserting on printed output.
Use expect { ... }.not_to change(...) and expect { ... }.to change(...) to assert that data stays unchanged in dry run and changes when committed:
require "data_shifter/spec_helper"
RSpec.describe DataShifts::BackfillFoo do
include DataShifter::SpecHelper
before { allow($stdout).to receive(:puts) }
it "does not persist changes in dry run" do
expect do
result = run_data_shift(described_class, dry_run: true)
expect(result).to be_ok
end.not_to change(Foo, :count)
end
it "persists changes when committed" do
expect do
result = run_data_shift(described_class, commit: true)
expect(result).to be_ok
end.to change(Foo, :count).by(1)
# Or for in-place updates: .to change { record.reload.bar }.from(nil).to("baz")
end
endRequirements
- Ruby ≥ 3.2.1
- Rails (ActiveRecord, ActiveSupport, Railties) ≥ 7.0
-
axn(Shift classes includeAxn) -
ruby-progressbar(for progress bars) -
webmock(for dry-run HTTP blocking; optional allowlist viaallow_external_requests [...]/DataShifter.config.allow_external_requests)