Project

humboldt

0.0
Repository is archived
No commit activity in last 3 years
No release in over 3 years
Humboldt provides a mapreduce API abstraction built on top of Rubydoop, and tools to run Hadoop jobs effortlessly both locally and on Amazon EMR
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
 Dependencies

Runtime

= 2.0.0.pre1
>= 1.44.0
>= 0
 Project Readme

Humboldt

Humboldt provides a tool-set on top of Rubydoop to run Hadoop jobs effortlessly both locally and on Amazon EMR. There is also some sugar added on top of the Rubydoop DSL.

Sugar

Type converters

Humboldt adds a number of type converters:

  • binary - String - Hadoop::Io::BytesWritable
  • encoded - String - MessagePack encoded Hadoop::Io::BytesWritable
  • text - String - Hadoop::Io::Text
  • json - Hash - Hadoop::Io::Text
  • long - Integer - Hadoop::Io::LongWritable
  • none - nil - Hadoop::Io::NullWritable

Use them like so:

class Mapper < Humboldt::Mapper
  input :long, :json
  # ...
end

Combine input files

Hadoop does not perform well with many small input files, since each file is handled by its own map task, by default. Humboldt bundles an input format which combines files. Due to a bug in Hadoop 1.0.3 (and other versions), Hadoop 2.2.0 is required to run this for input files on S3, see this bug.

Example usage:

Rubydoop.configure do |input_paths, output_path|
  job 'my job' do
    input input_paths, format: :combined_text
    set 'mapreduce.input.fileinputformat.split.maxsize', 32 * 1024 * 1024

    # ...
  end
end

mapreduce.input.fileinputformat.split.maxsize controls the maximum size of an input split.

Secondary sort

Example usage:

A common mapreduce pattern when you need to count uniques is secondary sort, which can be quite a pain to implement. Humboldt makes it really easy, all you need to do say which indexes to partition and group by:

Rubydoop.configure do |input_paths, output_path|
  job 'my job' do
    # ...

    secondary_sort 0, 10

    # ...
  end
end

See the API documentation for Rubydoop::JobDefinition#secondary_sort for more information on how use it.

Development setup

Download Hadoop and set up the classpath using

$ rake setup

The default is Hadoop 1.0.3. Specify a hadoop release by setting $HADOOP_RELEASE, e.g.

$ HADOOP_RELEASE=hadoop-2.2.0/hadoop-2.2.0 rake setup

Run the tests with

$ rake spec

Release a new gem

Bump the version number in lib/humboldt/version.rb, run rake gem:release.

Copyright

© 2014 Burt AB, see LICENSE.txt (BSD 3-Clause).