Schema Tools for OpenSearch and Elasticsearch
Features
- Specify index settings and mappings in simple
.json
files. - Migrate and reindex to a new index with zero downtime using aliases.
- Download schemas from existing aliases or indices.
- Create new aliases with sample schemas.
- Manage painless scripts independently from schema migrations.
Quick start
Install this Ruby gem.
gem install schema-tools
Add (or edit) a file called Rakefile
and add this line:
require 'schema_tools'
Configuration
Set the connection URL for your OpenSearch or Elasticsearch instance:
export OPENSEARCH_URL=http://localhost:9200
# or
export ELASTICSEARCH_URL=https://your-cluster.com
For authenticated instances, optionally set username and password:
export ELASTICSEARCH_USERNAME=your_username
export ELASTICSEARCH_PASSWORD=your_password
# or
export OPENSEARCH_USERNAME=your_username
export OPENSEARCH_PASSWORD=your_password
View available rake tasks
rake -T | grep " schema:"
Available schema tasks:
-
schema:migrate[alias_name]
- Migrate to a specific alias schema or migrate all schemas -
schema:new
- Create a new alias with sample schema -
schema:close[name]
- Close an index or alias -
schema:delete[name]
- Hard delete an index (only works on closed indexes) or delete an alias -
schema:drop[alias_name]
- Delete an alias (does not delete the index) -
schema:download
- Download schema from an existing alias or index -
schema:alias
- Create an alias for an existing index -
schema:seed
- Seed data to a live index -
schema:diff
- Compare all schemas to their corresponding downloaded alias settings and mappings
Download an existing schema
Run rake schema:download
to download a schema from an existing alias or index:
$ rake schema:download
# Aliases pointing to 1 index:
# 1. products -> products-20241201120000
# 2. users -> users-20241201120000
# Indexes not part of any aliases:
# 1. old-index
# 2. temp-index
# Please choose an alias or index to download:
# Enter 'alias:<name>' for an alias or 'index:<name>' for an index:
The task will generate schema definition files in a folder layout like this:
schemas/products # Folder name matches the alias name
settings.json # OpenSearch/Elasticsearch index settings
mappings.json # OpenSearch/Elasticsearch index mappings
reindex.painless # Optional reindexing data transformation logic
Migrate schemas
To migrate your OpenSearch/Elasticsearch indexes to the latest versions defined in the schemas/
folder:
rake schema:migrate
To migrate a specific alias:
rake 'schema:migrate[products]'
Create a new alias
Run rake schema:new
to create a new alias with an index and a sample schema:
$ rake schema:new
# Enter a new alias name:
# products
# ✓ Created index 'products-20241201120000' with alias 'products'
# ✓ Sample schema created at schemas/products
# - settings.json
# - mappings.json
Sample App
A sample app that uses schema-tools is available at: https://github.com/richkuz/schema-tools-sample-app
Directory structure reference
Example directory structure with multiple aliases:
schemas/products
settings.json
mappings.json
reindex.painless # Optional reindexing data transformation logic
schemas/users
settings.json
mappings.json
Each schema folder name matches the name of an alias.
Other settings and tasks
Use rake schema:seed
to seed an index with sample documents that conform to your schema.
Use DRYRUN
to simulate but not apply any POST/PUT/DELETE operations to your index:
DRYRUN=true rake schema:migrate
Use INTERACTIVE
to prompt to proceed before applying any POST/PUT/DELETE operations to your index:
INTERACTIVE=true rake schema:migrate
Use REINDEX_BATCH_SIZE
to control the batch size for reindexing operations (default: 1000):
REINDEX_BATCH_SIZE=500 rake schema:migrate
Use REINDEX_REQUESTS_PER_SECOND
to throttle reindexing operations (default: -1, no throttling):
REINDEX_REQUESTS_PER_SECOND=100 rake schema:migrate
Client responsibilities during breaking migrations
Clients MUST retry failed creates/updates/deletes for up to ~ 1 minute.
Writes will be temporarily disabled for a few seconds during the procedure to prevent data loss.
Clients MUST read and write to an alias. Clients must NOT write directly to an index.
To prevent downtime, the migration procedure only operates on aliased indexes.
Run rake schema:alias
to create a new alias pointed at an index.
Hard-deletes during reindexing will NOT affect the migrated index.
Clients can mitigate the lack of hard-delete support two ways:
-
(Recommended) Implement soft-deletes (e.g. set
deleted_at
) with a recurring hard-delete job. Run the hard-delete job after reindexing. -
Use RBAC to deny all
DELETE
operations during reindexing and implement continuous retries on failedDELETE
operations to ensure eventual consistency.
During reindexing, searches will return duplicate results for updated documents.
After reindexing, only the latest update will appear in search results.
Clients can mitigate seeing duplicate documents in two ways:
-
(Recommended) Clients may hide duplicate documents by implementing
collapse
on all searches.collapse
incurs a small performance cost to each query. Clients may choose tocollapse
only when the alias is configured to read from multiple indices. For a reference implementation of conditionally de-duping using acollapse
query while reindexing, see: https://github.com/richkuz/schema-tools-sample-app/blob/fc60718f5784e52d55b0c009e863f8b1c8303662/demo_script.rb#L255 -
Use RBAC to deny all
UPDATE
operations during reindexing and implement continuous retries on failedUPDATE
operations to ensure eventual consistency. This approach is suitable only for clients that can tolerate not seeing documents updated during reindexing.
Why there are duplicate updated documents during reindexing:
- The migration task configures an alias to read from both the original index and a catchup index, and write to the catchup index.
-
UPDATE
operations produce an additional document in the catchup index. - When clients
_search
the alias for an updated document, they will see two results: one result from the original index, and one result from the catchup index.
Theoretical Alternatives for UPDATE and DELETE
In theory, the migrate task could support alternative reindexing modes when constrainted by native Elasticsearch/OpenSearch capabilities.
- Preserve Hard-Deletes and Show All Duplicates
The migrate task could support clients that require hard-deletes during reindexing by adding the new index into the alias during migration. Clients would have to use _refresh
and delete_by_query
when deleting documents to ensure documents are deleted from all indexes in the alias during reindexing. If using DELETE
to delete a single document from an alias, clients might delete from the wrong index and receive a successful response containing "result: not_found". The new index would not reflect such a deletion. With this approach, clients would see duplicate documents in search results for all documents during reindexing, not just updated documents. Clients could hide duplicate documents by implementing collapse
on all searches.
- Ignore Hard-Deletes and Hide All Duplicates
Some clients might not be able to filter out duplicate documents during reindexing. The migrate task could support such clients by not returning any INSERTED or UPDATED documents until after the reindexing completes. This approach would not support hard-deletes. To support re-updating the same document during reindexing, clients would have to find documents to upsert based on a consistent ID, not based on a changing field.
Diagnosing a failed or aborted migration
If a migration fails or aborts, check status logs in the index named #{alias_name}-#{timestamp}-migration-log
Transform data during migration
Change the data when migrating to a new schema via the reindex.painless
script. For example, when renaming a field, the reindex.painless
script can specify how to modify data when migrating.
reindex.painless
runs one time when reindexing into a new index.
Manage painless scripts
- Download, edit, and upload centrally managed painless scripts.
- Version control painless scripts alongside code
- Manage scripts independently from schema migrations
- Easily sync scripts between different environments
- Track changes to scripts over time
Download painless scripts from cluster
To download all painless scripts from a cluster and store them in the painless_scripts/
directory (configurable via PAINLESS_SCRIPTS_PATH
environment variable):
rake painless_scripts:download
Upload painless scripts to cluster
To upload all *.painless
script files from the local painless_scripts
directory into the cluster.
rake painless_scripts:upload
Delete a painless script from cluster
To delete a specific painless script from the cluster:
rake 'painless_scripts:delete[script_name]'
This will:
- Delete the specified script from the cluster
- Accept script names with or without the
.painless
extension - Handle cases where the script doesn't exist gracefully
Apply a schema change to Staging and Production
Run GitHub Actions for your branch to prepare a given environment. The actions use the schema:migrate
task underneath.
GitHub Actions:
- OpenSearch Staging Migrate
- OpenSearch Production Migrate
Delete an index
Run rake 'schema:close[index_name]'
to close an index. This will prevent reads and writes to the index. Verify that the application can operate with the index in a closed state before deleting it.
Run rake 'schema:delete[index_name]'
to hard-delete an index. For safety, this task only hard-deletes indexes that are closed.
Run rake 'schema:close[alias_name]'
to close all indexes in an alias.
Run rake 'schema:delete[alias_name]'
to delete an alias and leave its indexes untouched.
Run rake 'schema:drop[alias_name]'
to delete an alias (does not delete the underlying index).
GitHub Actions:
- OpenSearch Staging Close Index
- OpenSearch Production Close Index
- OpenSearch Staging Delete Index
- OpenSearch Production Delete Index
How migrations work
When possible, rake schema:migrate
will update settings and mappings in-place on an aliased index, without reindexing. Only breaking changes require a reindex.
Migrating breaking changes requires careful orchestration of reads and writes to ensure documents that are created/updated during the migration are not lost.
Hard-delete operations are not preserved during a breaking migration. See "Client responsibilities" above for how to mitigate this.
Use case:
- I have an alias
products
pointing at indexproducts-20250301000000
. - I have heavy reads and writes with 100M+ documents in the index
- I want to reindex
products-20250301000000
into a new index and update theproducts
alias to reference it, without losing any creates/updates during the process.
Rake schema:migrate
solves this use case through the following procedure.
First, some terms:
-
alias_name
: Alias containing the index to migrateproducts
-
current_index
: First and only index in the aliasproducts-20250301000000
-
new_index
: Final canonical index into which to migratecurrent_index
products-20250601000000
-
catchup1_index
: Temp index to preserve writes during reindexproducts-20250601000000-catchup-1
-
catchup2_index
: Temp index to preserve writes while flushingcatchup1_index
products-20250601000000-catchup-2
-
log_index
: Index to log the migration state, not stored withalias_name
products-20250601000000-migration-log
SETUP
Create log_index
to log the migration state.
- The migration logs when it starts and completes a step along with a description.
STEP 1
Attempt to reindex 1 document to a throwaway index to catch obvious configuration errors and abort early if possible.
STEP 2
Create catchup1_index
using the new schema.
- This index will preserve writes during the reindex.
STEP 3
Configure alias_name
to only write to catchup1_index
and read from current_index
and catchup1_index
.
STEP 4
Create new_index
using the new schema.
Reindex current_index
into new_index
.
POST _reindex
{
"source": { "index": "#{current_index}" },
"dest": { "index": "#{new_index}" },
"conflicts": "proceed",
"refresh": false
}
STEP 5
Create catchup2_index
using the new schema.
- This index ensures a place for ongoing writes while flushing
catchup1_index
.
STEP 6
Configure alias_name
to only write to catchup2_index
and continue reading from current_index
and catchup1_index
.
STEP 7
Reindex catchup1_index
into new_index
.
- Merge the first catchup index into the new canonical index.
STEP 8
Configure alias_name
so there are NO write indexes
- This guarantees that no writes can sneak into an obsolete catchup index during the second (quick) merge.
- Any write operations will fail during this time with:
"reason": "Alias [FOO] has more than one index associated with it [...], can't execute a single index op"
- Clients must retry any failed writes.
STEP 9
Reindex catchup2_index
into new_index
- Final sync to merge the second catchup index into the new canonical index.
STEP 10
Configure alias_name
to write to and read from new_index
only.
- Writes resume to the single new index. All data and deletes are consistent.
STEP 11
Close unused indexes to avoid accidental writes.
- Close
catchup1_index
- Close
catchup2_index
- Close
current_index
Operation complete.
Users can safely delete closed indexes anytime after they are closed.