Project

ducklake

0.0
No release in over 3 years
DuckLake for Ruby
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
 Dependencies

Runtime

>= 0
 Project Readme

DuckLake Ruby

🦆 DuckLake for Ruby

Run your own data lake with a SQL database and file/object storage

DuckLake::Client.new(
  catalog_url: "postgres://user:pass@host:5432/dbname",
  storage_url: "s3://my-bucket/"
)

Learn more

Note: DuckLake is not considered production-ready at the moment

Build Status

Installation

First, install libduckdb. For Homebrew, use:

brew install duckdb

Then add this line to your application’s Gemfile:

gem "ducklake"

Getting Started

Create a client - this one stores everything locally

ducklake =
  DuckLake::Client.new(
    catalog_url: "sqlite:///ducklake.sqlite",
    storage_url: "data_files/",
    create_if_not_exists: true
  )

Create a table

ducklake.sql("CREATE TABLE events (id bigint, name text)")

Load data from a file

ducklake.sql("COPY events FROM 'data.csv'")

Confirm a new Parquet file was added to the data lake

ducklake.list_files("events")

Query the data

ducklake.sql("SELECT COUNT(*) FROM events").to_a

Catalog Database

Catalog information can be stored in:

  • Postgres: postgres://user@pass@host:5432/dbname
  • SQLite: sqlite:///path/to/dbname.sqlite
  • DuckDB: duckdb:///path/to/dbname.duckdb

Note: MySQL and MariaDB are not currently supported due to duckdb/ducklake#70 and duckdb/ducklake#210

There are two ways to set up the schema:

  1. Run this script
  2. Configure the client to do it
DuckLake::Client.new(create_if_not_exists: true, ...)

Data Storage

Data can be stored in:

  • Local files: data_files/
  • Amazon S3: s3://my-bucket/path/
  • Other providers: todo

Amazon S3

Credentials are detected in the standard AWS SDK locations, or you can pass them manually

DuckLake::Client.new(
  storage_options: {
    aws_access_key_id: "...",
    aws_secret_access_key: "...",
    region: "us-east-1"
  },
  ...
)

IAM permissions

  • Read: s3::ListBucket, s3::GetObject
  • Write: s3::ListBucket, s3::PutObject
  • Maintenance: s3::ListBucket, s3::GetObject, s3::PutObject, s3::DeleteObject

Operations

Create an empty table

ducklake.sql("CREATE TABLE events (id bigint, name text)")

Or a table from a file

ducklake.sql("CREATE TABLE events AS FROM 'data.csv'")

Load data from a file

ducklake.sql("COPY events FROM 'data.csv'")

You can also load data directly from other data sources

ducklake.attach("blog", "postgres://localhost:5432/blog")
ducklake.sql("INSERT INTO events SELECT * FROM blog.ahoy_events")

Or register existing data files

ducklake.add_data_files("events", "data.parquet")

Note: This transfers ownership to the data lake, so the file may be deleted as part of maintenance

Update data

ducklake.sql("UPDATE events SET name = ? WHERE id = 1", ["Test", 1])

Delete data

ducklake.sql("DELETE * FROM events WHERE id = ?", [1])

Update the schema

ducklake.sql("ALTER TABLE events ADD COLUMN active BOOLEAN")

Snapshots

Get snapshots

ducklake.snapshots

Query the data at a specific snapshot version or time

ducklake.sql("SELECT * FROM events AT (VERSION => ?)", [3])
#
ducklake.sql("SELECT * FROM events AT (TIMESTAMP => ?)", [Date.today - 7])

You can also specify a snapshot when creating the client

DuckLake::Client.new(snapshot_version: 3, ...)
# or
DuckLake::Client.new(snapshot_time: Date.today - 7, ...)

Maintenance

Merge files

ducklake.merge_adjacent_files

Expire snapshots

ducklake.expire_snapshots(older_than: Date.today - 7)

Clean up old files

ducklake.cleanup_old_files(older_than: Date.today - 7)

Configuration

Get options

ducklake.options

Set an option globally

ducklake.set_option("parquet_compression", "zstd")

Or for a specific table

ducklake.set_option("parquet_compression", "zstd", table_name: "events")

Read-Only Mode

Note: This feature is experimental and does not prevent the DuckDB engine from writing files via sql

Attach the catalog in read-only mode

DuckLake::Client.new(read_only: true, ...)

Use read-only credentials for catalog database and storage provider and disable external access

You should also consider disabling community extensions

ducklake.sql("SET allow_community_extensions = false")

And locking the configuration

ducklake.sql("SET lock_configuration = true")

External Access

Restrict external access to the DuckDB engine

ducklake.disable_external_access

Allow specific directories and paths

ducklake.disable_external_access(
  allowed_directories: ["/path/to/directory"],
  allowed_paths: ["/path/to/file.txt"]
)

The storage URL is automatically included in allowed_directories

SQL Safety

Use parameterized queries when possible

ducklake.sql("SELECT * FROM events WHERE id = ?", [1])

For places that do not support parameters, use quote or quote_identifier

quoted_table = ducklake.quote_identifier("events")
quoted_file = ducklake.quote("path/to/data.csv")
ducklake.sql("COPY #{quoted_table} FROM #{quoted_file}")

Polars

Note: This feature is experimental and does not work on tables with schema changes

Query the data with Ruby Polars

ducklake.polars("events")

Specify a snapshot

ducklake.polars("events", snapshot_version: 3)
# or
ducklake.polars("events", snapshot_time: Date.today - 7)

Reference

Get table info

ducklake.table_info

Get column info

ducklake.column_info("events")

Drop a table

ducklake.drop_table("events")
# or
ducklake.drop_table("events", if_exists: true)

List files

ducklake.list_files("events")

List files at a specific snapshot version or time

ducklake.list_files("events", snapshot_version: 3)
# or
ducklake.list_files("events", snapshot_time: Date.today - 7)

History

View the changelog

Contributing

Everyone is encouraged to help improve this project. Here are a few ways you can help:

To get started with development:

git clone https://github.com/ankane/ducklake-ruby.git
cd ducklake-ruby
bundle install

# Postgres
createdb ducklake_ruby_test
bundle exec rake test:postgres

# MySQL and MariaDB
mysqladmin create ducklake_ruby_test
bundle exec rake test:mysql

# SQLite
bundle exec rake test:sqlite

# DuckDB
bundle exec rake test:duckdb