No commit activity in last 3 years
No release in over 3 years
Elasticsearch package for the metacrunch ETL toolkit.
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
 Dependencies
 Project Readme

metacrunch-elasticsearch

Gem Version Code Climate Test Coverage CircleCI

This is the official Elasticsearch package for the metacrunch ETL toolkit.

NOTE: metacrunch-elasticsearch 5.x requires Elasticsearch 7.x. For older versions of Elasticsearch try metacrunch-elasticsearch 4.x

Installation

Include the gem in your Gemfile

gem "metacrunch-elasticsearch", "~> 5.0.0"

and run $ bundle install to install it.

Or install it manually

$ gem install metacrunch-elasticsearch

Usage

Note: For working examples on how to use this package check out our demo repository.

Metacrunch::Elasticsearch::Source

This class provides a metacrunch source implementation that can be used to read data from Elasticsearch into a metacrunch job.

# my_job.metacrunch

# Create a Elasticsearch connection 
elasticsearch = Elasticsearch::Client.new(...)

# Set the source
source Metacrunch::Elasticsearch::Source.new(elasticsearch, OPTIONS)

Options

  • :search_options: A hash with search options (including your query) as described here. We have set some meaningful defaults though: size: 100, scroll: 1m, sort: ["_doc"]. Depending on your use-case it may be needed to modify :size and :scroll for optimal performance.
  • :total_hits_callback: You can set a Proc that gets called with the total number of hits your query will match. Use can use this callback to setup a progress bar for example. Defaults to nil.

Metacrunch::Elasticsearch::Destination

This class provides a metacrunch destination implementation that can be used to write data from a metacrunch job to Elasticsearch.

The data that gets passed to the destination, must be in a proper format. You can use a transformation to transform your data before it reaches the destination.

As Metacrunch::Elasticsearch::Destination utilizes the Elasticsearch bulk API, the expected format must match one of the available options for the bodyparameter described here. Please note that you can use the bulk API not only to index records. You can update or delete records as well.

# my_job.metacrunch

# Transform data into a format that the destination can understand.
# In this example `data` is some hash.
transformation ->(data) do
  {
    index: {
      _index: "my-index",
      _id: data.delete(:id),
      data: data
    }
  }
end

It is not efficient to call Elasticsearch for every single record. Therefore we can use a transformation with a buffer, to create bulks of records. In this example we use a buffer size of 10. In production environments and depending on your data, larger buffers may be useful.

# my_job.metacrunch

transformation ->(data) { data }, buffer: 10

If these transformations are in place you can now use the Metacrunch::Elasticsearch::Destination class as a destination.

# my_job.metacrunch

# Write data into elasticsearch
destination Metacrunch::Elasticsearch::Destination.new(elasticsearch [, OPTIONS])

Options

  • :result_callback: You can set a Proc that gets called with the result from the bulk operation. Defaults to nil.
  • :bulk_options: A hash of options for the Eleasticsearch bulk API as described here. Setting body here will be ignored. Defaults to {}.

License

metacrunch-elasticsearch is available at github under MIT license.