ActiveDataFrame

ActiveDataFrame allows efficient writing, reading, and analytical queries on large tables of numerical data. You can think of it as a persistent NumPy or NArray with good support for slicing and aggregates without needing the entire dataset in memory.

The library depends on ActiveRecord and currently supports the following relational databases:

PostgreSQL
MySQL
SQLite

Installation

Add this line to your application's Gemfile:

gem 'active_data_frame'

And then execute:

$ bundle

Or install it yourself as:

$ gem install active_data_frame

Examples

Using the generator

# Generate a new data frame named Statistic, with a datapoint type of double, and a block size of 100
rails generate active_data_frame:install Statistic double 100

# Then run migrations to create the underlying table
rake db:migrate

Usage

Generator

The easiest way to get started is to use the in-built generator to generate a new ActiveDataFrame. This will generate the required migrations for the data frame and generate a new module that you can include inside an ActiveRecord model to give it access to the frame.

# Generate a new MeterReading data frame type, with a block type of
# double and a block size of 48 data points

rails generate active_data_frame:install MeterReading double 48

# Generate a new Dimension data frame type, with a block type of
# float and a block size of 10 data points.
# Inject the data-type for use into the Iris model

rails generate active_data_frame:install Dimension float 10 Iris

#
# Generate a new status data frame type with an integer block type
#
rails generate active_data_frame:install Status integer

Writing to a data frame

When you include a data frame in an ActiveRecord model, each instance of the model corresponds to a single row in the data frame. The columns are a series of points that stretch towards infinity in each direction.

By default columns are indexed by integers, but you can set a static or dynamic column map so that you can easily have columns indexed by time, enum columns or use any other data type that serves as a useful index.

You can write any number of data points to a row in the dataframe using #[]=

#E.g.
# Write to the row called readings from index 0. Here Sensor is the ActiveRecord model, readings is the name of the row
Sensor.first.readings[0] = 1,2,3

# Write to the row called readings from an offset at 1_000_000
Sensor.first.readings[1_000_000] = -10, -9, -8

#Writing to a row which has a column mapping applied, mapping times on integer indexes
MeterChannel.first.readings['2001-01-01'] = [1.3, 3.4]

#If you have enum columns you can use the #[enum_name]= setter instead.
Iris.first.dimensions.sepal_length = 5.3
Iris.first.dimensions.petal_width  = 4.3

# You can set data for multiple rows at once, by using the frame accessor on the model's class instead of an instance.

E.g.
# This sets the reading at index 1 to 5 for ALL sensors
Sensor.readings[1] = 5

# You can use AR queries to refine which set of rows you are updating at once.
# E.g.
MeterChannel.where("created_at < ?", "2001-01-01").readings['2001-01-01'] = [5,6,7]

ActiveDataFrame supports very quick writing of 1000's of values for a single row at a time. Don't be afraid to write large arrays of data like this.

Reading from a data frame

Reading from a data frame is similar to writing and uses the #[] method. You can read individual values, a range of values, and sparse selections of columns.

#E.g.
# Read a single value
Sensor.first.readings[0] # => Matrix(1x1)[...]

# Read a range of 3 values values
Sensor.first.readings[0...3] # => Matrix(1x3)[...]

# Read some non contiguous values and ranges
Sensor.first.readings[5, 10, 4..7, 9..10] = Matrix(1x8)[...]

#Reading from a row which has a column mapping that uses times
MeterChannel.first.readings['2001-01-01'...'2002-01-01'] = Matrix(1xM)[....]

#If you have enum columns you can use the #[enum_name] getter for single columns
Iris.first.dimensions.sepal_length
Iris.first.dimensions.petal_width

# And use symbols as column indices (this assumes a specific ordering of enum columns)
Iris.first.dimensions[:sepal_length...:petal_width]

Similar to when writing data, you can also read data from multiple rows at once. Just use the active data frame accessor on the model class instead of a model instance. E.g.

Sensor.readings[0..5] # => Matrix(Nx5)

Deleting

You can use #clear(range_or_indices) to delete data.


Deleting data is equivalent to setting all data points to zero.
So the operation row[index] = [0, 0, 0, 0.....0] is equivalent
to the operation row.clear(index...end_index). ActiveDataFrame
will automatically trim empty blocks.

Batching

If performing many small reads and writes from a data frame in a single atomic operation it makes sense to do this in a single transaction. Active Data Frame provides the ActiveDataFrame::Database.batch do ... end method. This method will not only ensure your operations occur in a single transaction, but also that they are sent to the underlying database adapter as a single command.

Analytical Queries

Any read of a dataframe returns an RMatrix instance. An RMatrix supports a large number of statistical methods and list methods. (See the RMatrix readme for more details). E.g.

cpu_loads = CPU.first.loads['2001-01-01'..'2005-01-01']
puts cpu_loads.avg
puts cpu_loads.stddev
puts cpu_loads.max
# ... and many more

However in some cases you are dealing with so much data it is not possible, or too slow to retreive all the data at once and manipulate in-memory. ActiveDataFrame supports performing a number of aggregate methods directly in the database. These are #avg, #min, #max and #sum. The syntax for this is almost identical to an ordinary read.

CPU.loads.avg['2001-01-01'...'2005-01-01'] # The average CPU load per period over all CPUS

CPU.where(manufacturer: :intel).loads.min['2001-01-01'...'2005-01-01'] # The minimum CPU load per period over all intel CPUS

Categorical data

ActiveDataFrame provides a very basic abstraction for storing categorical data. This is done by storing categories as an integer data frame, and providing a map from integers to categories. The library will then allow you to use the category names in place of the raw underlying integers. E.g.

module HasStatus
  include ActiveDataFrame::HasDataFrame('status', Blocks::StatusBlock, value_map: {
    actual: 2,
    estimated: 1,
    unknown: 0
  })
end

class CPU < ApplicationRecord
  include HasStatus
end

The CPU model above includes a dataframe with a status mapping. We can now do things like

CPU.first.status[0]    # => :unknown
CPU.first.status[0..5] # => [:unknown,:unknown,:unknown,:unknown,:unknown]

CPU.first.status[0] = :actual, :estimated
CPU.first.status[0..5] # => [:actual,:estimated,:unknown,:unknown,:unknown]

Time-series data

We can use any datatype we like to index into a dataframe, so long as we can map it to an integer index. This makes active dataframes very well suited to storing large streams of interval data over time.

For example we might define a mapping such that every half hour period in time corresponds to a colum in our dataframe. In the below example we might be counting the number of arrivals at an airport every half-hour.

module HasArrivals
  include ActiveDataFrame::HasDataFrame('arrivals', Blocks::ArrivalBlock)
  module ColumnMaps
    def self.included(base)
      base.arrivals_column_map Hash.new{|hash, time| ((time.to_time - Time.at(0)) / 1.hour).to_i rescue time.to_i }
    end
  end
end

class Airport < ApplicationRecord
  include HasArrivals::ColumnMaps, HasArrivals
end

Now we can use any value that implements #to_time to index into our dataframe. This supports both single indexes and ranges (...). E.g.

Airport.first.arrivals['2001-01-01'...'2002-01-01'] = Matrix(1xM)[....]

Column Mappings

We can use any datatype we like to index into a dataframe, so long as we can map it to an integer index. See the section on Time-series data for one example of this. Columns can also be aliases to categories. An example of this is using ActiveDataFrame to model the classic Iris dataset.

class Iris < ApplicationRecord
  include HasDimensions
  dimension_column_names %i(sepal_length sepal_width petal_length petal_width)
end

Here we have mapped the first four columns of our data frame to sepal_length, sepal_width, petal_length and petal_width.

When using symbols as column names ActiveDataFrame provides some syntactic sugar for easily slicing and dicing frames.

We can do things like:

Extract a slice of data:

iris_results = Iris.where(species: :setosa).dimension[:sepal_width..:petal_length]
Extract an entire column from a data-set using the column name:

iris_results.sepal_width => V[[...]]
Extract an entire column from a data-set using the column name:

iris_results.sepal_width => V[[...]]
Extract a single value from an instance:

Iris.first.dimension.sepal_width.to_f
Set one or more values for an instance or row at once:

Iris.first.dimension.sepal_width = 13 Iris.all.dimension.petal_length = 5.2,6.3,5.4,1.1

Configuration

ActiveDataFrame supports project-wide configuration using

ActiveDataFrame.config do |config|
  config.[config_option_name] = [config_value]
end

Currently the following configuration options are supported:

suppress_logs The queries generated by ActiveDataFrame are quite verbose. If you would like to supress ActiveRecord logging for these queries, set this option to true

Development

After checking out the repo, run bin/setup to install dependencies. You can also run bin/console for an interactive prompt that will allow you to experiment.

To install this gem onto your local machine, run bundle exec rake install. To release a new version, update the version number in version.rb, and then run bundle exec rake release, which will create a git tag for the version, push git commits and tags, and push the .gem file to rubygems.org.

Testing

Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/[USERNAME]/active_data_frame. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the Contributor Covenant code of conduct.