Project

parqueteur

0.0
No release in over a year
Convert data to Parquet
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
 Dependencies

Runtime

 Project Readme

Parqueteur

Gem Version

Parqueteur enables you to generate Apache Parquet files from raw data.

Dependencies

Since I only tested Parqueteur on Ubuntu, I don't have any install scripts for others operating systems.

Debian/Ubuntu packages

  • libgirepository1.0-dev
  • libarrow-dev
  • libarrow-glib-dev
  • libparquet-dev
  • libparquet-glib-dev

You can check scripts/apache-arrow-ubuntu-install.sh script for a quick way to install all of them.

Installation

Add this line to your application's Gemfile:

gem 'parqueteur', '~> 1.0'

(optional) If you don't want to require Parqueteur globally you can add require: false to the Gemfile instruction:

gem 'parqueteur', '~> 1.0', require: false

And then execute:

$ bundle install

Or install it yourself as:

$ gem install parqueteur

Usage

Parqueteur provides an elegant way to generate Apache Parquet files from a defined schema.

Converters accepts any object that implements Enumerable as data source.

Working example

require 'parqueteur'

class FooParquetConverter < Parqueteur::Converter
  column :id, :bigint
  column :reference, :string
  column :datetime, :timestamp
end

data = [
  { 'id' => 1, 'reference' => 'hello world 1', 'datetime' => Time.now },
  { 'id' => 2, 'reference' => 'hello world 2', 'datetime' => Time.now },
  { 'id' => 3, 'reference' => 'hello world 3', 'datetime' => Time.now }
]

# initialize Converter with Parquet GZIP compression mode
converter = FooParquetConverter.new(data, compression: :gzip)

# write result to file
converter.write('hello_world.parquet')

# in-memory result (StringIO)
converter.to_io

# write to temporary file (Tempfile)
# don't forget to `close` / `unlink` it after usage
converter.to_tmpfile

# convert to Arrow::Table
pp converter.to_arrow_table

Using transformers

You can use transformers to apply data items transformations.

From examples/cars.rb:

require 'parqueteur'

class Car
  attr_reader :name, :production_year

  def initialize(name, production_year)
    @name = name
    @production_year = production_year
  end
end

class CarParquetConverter < Parqueteur::Converter
  column :name, :string
  column :production_year, :integer

  transform do |car|
    {
      'name' => car.name,
      'production_year' => car.production_year
    }
  end
end

cars = [
  Car.new('Alfa Romeo 75', 1985),
  Car.new('Alfa Romeo 33', 1983),
  Car.new('Audi A3', 1996),
  Car.new('Audi A4', 1994),
  Car.new('BMW 503', 1956),
  Car.new('BMW X5', 1999)
]

# initialize Converter with Parquet GZIP compression mode
converter = CarParquetConverter.new(data, compression: :gzip)

# write result to file
pp converter.to_arrow_table

Output:

#<Arrow::Table:0x7fc1fb24b958 ptr=0x7fc1faedd910>
#     name           production_year
0     Alfa Romeo 75  1985
1     Alfa Romeo 33  1983
2     Audi A3        1996
3     Audi A4        1994
4     BMW 503        1956
5     BMW X5         1999

Available Types

Name (Symbol) Apache Parquet Type
:array Array
:bigdecimal Decimal256
:bigint Int64 or UInt64 with unsigned: true option
:boolean Boolean
:date Date32
:date32 Date32
:date64 Date64
:decimal Decimal128
:decimal128 Decimal128
:decimal256 Decimal256
:int32 Int32 or UInt32 with unsigned: true option
:int64 Int64 or UInt64 with unsigned: true option
:integer Int32 or UInt32 with unsigned: true option
:map Map
:string String
:struct Struct
:time Time32
:time32 Time32
:time64 Time64
:timestamp Timestamp

Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/pocketsizesun/parqueteur-ruby.