Repository is archived
No commit activity in last 3 years
No release in over 3 years
Simple Data Warehouse toolkit for ruby
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
 Dependencies

Development

>= 0
>= 0
>= 0
~> 2.0
>= 0

Runtime

 Project Readme

Chicago

Chicago is a small-scale data warehouse library written in Ruby.

This library focuses on the following:

  • Defining a model that represents a Star Schema.
  • Creation of migrations to manipulate one or more concrete database schemas to reflect the model.
  • Querying data in a Star Schema.

It heavily uses and is influenced by the sequel library. This library itself provides no ETL functionality; there is a highly unfinished/experimental library for this at chicago-etl

Installation

gem install chicagowarehouse

Star Schema Features

If you are new to star schemas in general, I recommend reading The Data Warehouse Toolkit by Ralph Kimball & Margy Ross. Briefly though, a star schema is a sem-denormalised style of database design optimized for reporting-style queries. In a star schema, there are two types of tables:

  • Fact tables store measures that can be summed or averaged, together with keys to Dimensions.
  • Dimension tables store denormalised data with which you may want to group or filter facts.

Generally speaking, the only links are between 1 fact table and several dimension tables: facts are not joined to facts, dimensions are not joined to dimensions - this gives the star schema its name, as you can see in the picture below:

Star Schema

Not all standard star schema features are implemented yet - probably the most important ommission is that Type II and Type III dimensions are not supported (dimensions that retain historic information).

Features that are supported include:

  • Fact & Type I Dimensions tables
  • Measures, additive & semi-additive
  • Degenerate Dimensions
  • Null Records
  • Dimensions with predetermined values
  • Key tables for ETL processes, supporting both single integer keys and hash keys for dimensions without an original primary key
  • Calculated columns

Example Usage

Defining models:

require 'chicago'

SCHEMA = Chicago::StarSchema.new

SCHEMA.define_dimension(:product) do
  columns do
    integer :original_id
    string :name, :default => "Unknown Product"
    money :price
    # ...
  end

  identified_by :name
  natural_key :original_id
  null_record :original_id => 0
end

# ...

SCHEMA.define_fact(:sales) do
  dimensions :product, :customer

  degenerate_dimensions do
    integer :order_id
  end

  measures do
    money :total
    integer :number_of_items
  end
end

Once you have a schema defined, you can use the inbuilt tasks to create the underlying database schema in your Rakefile:

DB = Sequel.connect() # with your db connection parameters
Chicago::RakeTasks.new(SCHEMA, :staging_db => DB)

Which gives the following tasks:

  • db:write_migrations - creates migration files in a migrations directory - use the standard Sequel migration tool to run them;
  • db:create_null_records - creates null records in the database.

After you have some data, you can query your schema like so:

query = {
  :table_name => "sales",
  :query_type => "fact",
  :columns => ["sales.product.name",
               {:column => "sales.total", :op => "sum"}]
}

sequel_dataset = Chicago::Query.new(SCHEMA, query).dataset

# Do whatever you like with this dataset, for example just returning
# all records:
sequel_dataset.all

The query hash is intended to be easy to represent in JSON. Notice how unlike just using base Sequel, you didn't have to define any JOIN or GROUP BY parts to the query - these are inferred from the model.

Contributing

To get specs passing, you'll need to create a test myqsl database, and copy spec/database.yml.dist to spec/database.yml and populate it appropriately.