DuckLake Ruby

🦆 DuckLake for Ruby

Run your own data lake with a SQL database and file/object storage

DuckLake::Client.new(
  catalog_url: "postgres://user:pass@host:5432/dbname",
  storage_url: "s3://my-bucket/"
)

Learn more

Note: DuckLake is not considered production-ready at the moment

Installation

First, install libduckdb. For Homebrew, use:

brew install duckdb

Then add this line to your application’s Gemfile:

gem "ducklake"

Getting Started

Create a client - this one stores everything locally

ducklake =
  DuckLake::Client.new(
    catalog_url: "sqlite:///ducklake.sqlite",
    storage_url: "data_files/",
    create_if_not_exists: true
  )

Create a table

ducklake.sql("CREATE TABLE events (id bigint, name text)")

Load data from a file

ducklake.sql("COPY events FROM 'data.csv'")

Confirm a new Parquet file was added to the data lake

ducklake.list_files("events")

Query the data

ducklake.sql("SELECT COUNT(*) FROM events").to_a

Catalog Database

Catalog information can be stored in:

Postgres: postgres://user@pass@host:5432/dbname
SQLite: sqlite:///path/to/dbname.sqlite
DuckDB: duckdb:///path/to/dbname.duckdb

Note: MySQL and MariaDB are not currently supported due to duckdb/ducklake#70 and duckdb/ducklake#210

There are two ways to set up the schema:

Run this script
Configure the client to do it

DuckLake::Client.new(create_if_not_exists: true, ...)

Data Storage

Data can be stored in:

Local files: data_files/
Amazon S3: s3://my-bucket/path/
Other providers: todo

Amazon S3

Credentials are detected in the standard AWS SDK locations, or you can pass them manually

DuckLake::Client.new(
  storage_options: {
    aws_access_key_id: "...",
    aws_secret_access_key: "...",
    region: "us-east-1"
  },
  ...
)

IAM permissions

Read: s3::ListBucket, s3::GetObject
Write: s3::ListBucket, s3::PutObject
Maintenance: s3::ListBucket, s3::GetObject, s3::PutObject, s3::DeleteObject

Operations

Create an empty table

ducklake.sql("CREATE TABLE events (id bigint, name text)")

Or a table from a file

ducklake.sql("CREATE TABLE events AS FROM 'data.csv'")

Load data from a file

ducklake.sql("COPY events FROM 'data.csv'")

You can also load data directly from other data sources

ducklake.attach("blog", "postgres://localhost:5432/blog")
ducklake.sql("INSERT INTO events SELECT * FROM blog.ahoy_events")

Or register existing data files

ducklake.add_data_files("events", "data.parquet")

Note: This transfers ownership to the data lake, so the file may be deleted as part of maintenance

Update data

ducklake.sql("UPDATE events SET name = ? WHERE id = ?", ["Test", 1])

Delete data

ducklake.sql("DELETE * FROM events WHERE id = ?", [1])

Transactions

Run multiple statements in a transaction

ducklake.transaction do
  # ...
end

Raise DuckLake::Rollback to rollback

Add commit info to a transaction

ducklake.transaction(commit_message: "...", commit_author: "...") do
  # ...
end

This will appear in snapshots

Schema Changes

Update the schema

ducklake.sql("ALTER TABLE events ADD COLUMN active BOOLEAN")

Set or remove a partitioning key

ducklake.sql("ALTER TABLE events SET PARTITIONED BY (name)")
# or
ducklake.sql("ALTER TABLE events RESET PARTITIONED BY")

Views

Create a view

ducklake.sql("CREATE VIEW events_view AS SELECT * FROM events")

Drop a view

ducklake.sql("DROP VIEW events_view")

Snapshots

Get snapshots

ducklake.snapshots

Get the current snapshot

ducklake.current_snapshot

Query the data at a specific snapshot version or time

ducklake.sql("SELECT * FROM events AT (VERSION => ?)", [3])
# or
ducklake.sql("SELECT * FROM events AT (TIMESTAMP => ?)", [Date.today - 7])

You can also specify a snapshot when creating the client

DuckLake::Client.new(snapshot_version: 3, ...)
# or
DuckLake::Client.new(snapshot_time: Date.today - 7, ...)

Maintenance

Merge files

ducklake.merge_adjacent_files

Expire snapshots

ducklake.expire_snapshots(older_than: Date.today - 7)

Clean up old files

ducklake.cleanup_old_files(older_than: Date.today - 7)

Rewrite files with a certain percentage of deleted rows

ducklake.rewrite_data_files(delete_threshold: 0.5)

Configuration

Get options

ducklake.options

Set an option globally

ducklake.set_option("parquet_compression", "zstd")

Or for a specific table

ducklake.set_option("parquet_compression", "zstd", table_name: "events")

Read-Only Mode

Note: This feature is experimental and does not prevent the DuckDB engine from writing files via sql

Attach the catalog in read-only mode

DuckLake::Client.new(read_only: true, ...)

Use read-only credentials for catalog database and storage provider and disable external access

You should also consider disabling community extensions

ducklake.sql("SET allow_community_extensions = false")

And locking the configuration

ducklake.sql("SET lock_configuration = true")

External Access

Restrict external access to the DuckDB engine

ducklake.disable_external_access

Allow specific directories and paths

ducklake.disable_external_access(
  allowed_directories: ["/path/to/directory"],
  allowed_paths: ["/path/to/file.txt"]
)

The storage URL is automatically included in allowed_directories

SQL Safety

Use parameterized queries when possible

ducklake.sql("SELECT * FROM events WHERE id = ?", [1])

For places that do not support parameters, use quote or quote_identifier

quoted_table = ducklake.quote_identifier("events")
quoted_file = ducklake.quote("path/to/data.csv")
ducklake.sql("COPY #{quoted_table} FROM #{quoted_file}")

Encryption

Note: This feature is experimental and must be set when creating the catalog

Encrypt Parquet files

DuckLake::Client.new(encryption: true, ...)

Polars

Note: This feature is experimental and does not currently work on tables with schema changes

Query the data with Ruby Polars

ducklake.polars("events")

Specify a snapshot

ducklake.polars("events", snapshot_version: 3)
# or
ducklake.polars("events", snapshot_time: Date.today - 7)

Reference

Get table info

ducklake.table_info

Get column info

ducklake.column_info("events")

Drop a table

ducklake.drop_table("events")
# or
ducklake.drop_table("events", if_exists: true)

List files

ducklake.list_files("events")

List files at a specific snapshot version or time

ducklake.list_files("events", snapshot_version: 3)
# or
ducklake.list_files("events", snapshot_time: Date.today - 7)

History

View the changelog

Contributing

Everyone is encouraged to help improve this project. Here are a few ways you can help:

Report bugs
Fix bugs and submit pull requests
Write, clarify, or fix documentation
Suggest or add new features

To get started with development:

git clone https://github.com/ankane/ducklake-ruby.git
cd ducklake-ruby
bundle install

# Postgres
createdb ducklake_ruby_test
createdb ducklake_ruby_test2
bundle exec rake test:postgres

# MySQL and MariaDB
mysqladmin create ducklake_ruby_test
mysqladmin create ducklake_ruby_test2
bundle exec rake test:mysql

# SQLite
bundle exec rake test:sqlite

# DuckDB
bundle exec rake test:duckdb

ducklake

Runtime