Elasticsearch::Extensions::Documents

A service wrapper to manage Elasticsearch index documents. Built on the elasticsearch-ruby Gem.

Installation

Add this line to your application's Gemfile:

gem 'elasticsearch-documents'

And then execute:

$ bundle

Or install it yourself as:

$ gem install elasticsearch-documents

Configuration

Before making any calls to Elasticsearch you need to configure the Documents extension. Configuration options namespaced under 'client' are passed through to Elasticsearch::Client.

ES_MAPPINGS = {
  user: {
    _all: { analyzer: "snowball" },
    properties: {
      id:   { type: "integer", index: :not_analyzed },
      name: { type: "string", analyzer: "snowball" },
      bio:  { type: "string", analyzer: "snowball" },
      updated_at:   { type: "date", include_in_all: false }
    }
  }
}

ES_SETTINGS = {
  index: {
    number_of_shards: 3,
    number_of_replicas: 2,
  }
}

Elasticsearch::Extensions::Documents.configure do |config|
  config.index_name = 'test_index'              # the name of your index
  config.mappings   = ES_MAPPINGS               # a hash containing your index mappings
  config.settings   = ES_SETTINGS               # a hash containing your index settings
  config.client.url = 'http://example.com:9200' # your elasticsearch endpoint
  config.client.logger = Rails.logger           # the logger to use. (defaults to Logger.new(STDERR))
end

If you are using this extension with a Rails application this configuration could live in an initializer like config/initializers/elasticsearch.rb.

Usage

The Documents extension builds on the elasticsearch-ruby Gem adding conventions and helper classes to aide in the serialization and flow of data between your application code and the elasticsearch-ruby client. To accomplish this the application data models will be serialized into instances of Document classes. These Document instances are then indexed and searched with wrappers around the elasticsearch-ruby client.

Saving a Document

Assume your application has a User model. To index the User records you define a Document that maps User records to a search index mapping.

class UserDocument < Elasticsearch::Extensions::Documents::Document
  indexes_as_type :user

  def as_hash
    {
      name:   object.name,
      title:  object.title,
      bio:    object.bio,
    }
  end

end

user = User.new  # could be a PORO or an ActiveRecord model
user_doc = UserDocument.new(user)

index = Elasticsearch::Extensions::Documents::Index.new
index.index(user_doc)

Deleting a Document

Deleting a document is just as easy

user_doc = UserDocument.new(user)
index.delete(user_doc)

Searching for Documents

Create classes which include Elasticsearch::Extensions::Documents::Queryable. Then implement a #as_hash method to define the JSON structure of an Elasticsearch Query using the Query DSL. This hash should be formatted appropriately to be passed on to the Elasticsearch::Transport::Client#search method.

class GeneralSiteSearchQuery
  include Elasticsearch::Extensions::Documents::Queryable

  def as_hash
    {
      index: 'test_index',
      body: {
        query: {
          query_string: {
            analyzer: "snowball",
            query:    "something to search for",
          }
        }
      }
    }
  end
end

You could elaborate on this class with a constructor that takes the search term and other options specific to your use case as arguments. The impoortant part is to define the #as_hash method.

You can then call the #execute method to run the query. The Elasticsearch JSON response will be returned in whole wrapped in a Hashie::Mash instance to allow the results to be interacted with in object notation instead of hash notation.

query = GeneralSiteSearchQuery.new
results = query.execute
results.hits.total
results.hits.max_score
results.hits.hits.each { |hit| puts hit._source }

You can also easily define a custom result format by overriding the #parse_results method in your Queryable class.

class GeneralSiteSearchQuery
  include Elasticsearch::Extensions::Documents::Queryable

  def as_hash
    # your query structure here
  end

  def parse_results(raw_results)
    CustomQueryResults.new(raw_results)
  end
end

Here the CustomQueryResults gets passed the Hashie::Mash results object and can parse and coerce that data into whatever structure is most useful for your application.

Index Management

The Indexer uses the Elasticsearch::Extensions::Documents.configuration to create the index with the configured #index_name, #mappings, and #settings.

indexer = Elasticsearch::Extensions::Documents::Indexer.new
indexer.create_index
indexer.drop_index

The Indexer can #bulk_index documents sending multiple documents to Elasticsearch in a single request. This may be more efficient when programmatically re-indexing entire sets of documents.

user_documents = users.collect { |user| UserDocument.new(user) }
indexer.bulk_index(user_documents)

The Indexer accepts a block to the #reindex method to encapsulate the processes of dropping the old index, creating a new index with the latest configured mappings and settings, and bulk indexing a set of documents into the newly created index. The content of the block should be the code that creates your documents in batches and passes them to the #bulk_index method of the Indexer.

indexer.reindex do |indexer|

  # For ActiveRecord you may want to find_in_batches
  User.find_in_batches(batch_size: 500) do |batch|
    documents = batch.map { |user| UserDocument.new(user) }
    indexer.bulk_index(documents)
  end

  # Otherwise you can add whatever logic you need to bulk index your documents
  documents = users.map { |model| UserDocument.new(model) }
  indexer.bulk_index(documents)
end

By default the call to #reindex will create the index if it does not yet exist. If the index already exists it will be left in place and the documents provided to be indexed will be added or updated as needed. You can force the index to be dropped and recreated during the reindex process by passing the force_create: true option:

indexer.reindex(force_create: true) do |indexer|
  # bulk index those documents into a fresh index
end

Different reindexing strategies may be added in the future to allow "zero downtime reindexing". This could be accomplished using index names with a timestamp appended and index aliases.

Contributing

Fork it
Create your feature branch (git checkout -b my-new-feature)
Commit your changes (git commit -am 'Add some feature')
Push to the branch (git push origin my-new-feature)
Create new Pull Request

elasticsearch-documents

Development

Runtime