There's a lot of open issues
An ETL Ecosystem for Derivative Processing.
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
 Dependencies
 Project Readme

Table of Contents generated with DocToc

  • DerivativeRodeo
    • Process Life Cycle
    • Concepts
      • Common Storage
      • Related Files
      • Sequence Diagram
    • Installation
    • Usage
    • Technical Overview of the DerivativeRodeo
      • Generators
        • Interface(s)
        • Supported Generators
        • Registered Generators
      • Storage Locations
        • Supported Storage Locations
        • Templates
    • Development
      • Logging in Test Environment
    • Contributing

DerivativeRodeo

“This ain’t my first rodeo.” (an idiomatic American slang for “I’m prepared for what comes next.”)

The DerivativeRodeo "moves" files from one storage location (e.g. input) to one or more storage locations (e.g. output) via a generator.

Process Life Cycle

In the case of a input storage location (e.g. input_location), we expect that the underlying file pointed at by the input storage location exists. After all we can't move what we don't have.

In the case of a output storage location (e.g. output_location), we expect that the underlying file will exist after the generator has completed. The output storage location could already exist or we might need to generate the file for the output location.

There is also the concept of the pre_processed storage location; when the pre_processed storage location exists for the given input, copy that pre_processed file to the output location. And skip running the derivative generator on the input storage location. In other words, if we've already done the derivation elsewhere, use that.

During the generator's process, we need to have a working copy of both the input and output file. This is done by creating a temporary file.

In the case of the input, the creation of that temporary file involves getting the file from the input storage location. In the case of the output, we create a temporary file that the output storage location then knows how to move to the resulting place.

Storage Lifecycle

The above Storage Lifecycle diagram is as follows: input location to input tmp file to generator to output tmp file to output location.

Note: We've designed and implemented the data life cycle to automatically clean-up the temporary files as the generator completes. In this way we can use the smallest working space possible. A design decision that helps run DerivativeRodeo within distributed clusters (e.g. AWS Serverless).

Concepts

Overview

The PlantUML Text for the Overview Diagram
@startuml
!theme amiga

cloud "Source 1" as S1
cloud "Source 2" as S2
cloud "Source 3" as S3

storage "IMAGEs" as IMAGEs
storage "HOCRs" as HOCRs
storage "TXTs" as TXTs

control Preprocess as G1

S1 -down-> G1
S2 -down-> G1
S3 -down-> G1

G1 -down-> IMAGEs
G1 -down-> HOCRs
G1 -down-> TXTs

control Import as I1

IMAGEs -down-> I1
HOCRs -down-> I1
TXTs -down-> I1

package FileSet as FileSet1 {
	file Image1
	file Hocr1
	file Txt1
}
package FileSet as FileSet2 {
	file Image2
	file Hocr2
	file Txt2
}

I1 -down-> FileSet1
I1 -down-> FileSet2

@enduml

Common Storage

In this case, common storage could mean the storage where we're writing all pre-processing of files. Or it could mean the storage where we're writing for application access (e.g. Fedora Commons for a Hyrax application).

In other words, the DerivativeRodeo is part of moving files from one location to another, and ensuring that at each step we have all of the expected files we want.

Related Files

This is not strictly related to Hyrax's FileSet, that is a set of files in which one is considered the original and all others are derivatives of the original.

However it is helpful to think in those terms; files that have a significant relation to each other; one derived from the other. For example an original PDF and it's extracted text would be two significantly related files.

Sequence Diagram

Sequence Diagram

The PlantUML Text for the Sequence Diagram
@startuml
!theme amiga

actor Instigator
database S3
control AWS
queue SQS
control SpaceStone
control DerivativeRodeo
collections From
collections To
Instigator -> S3 : "Upload bucket\nof files associated\n with FileSet"
S3 -> AWS : "AWS enqueues\nthe bucket"
AWS -> SQS : "AWS adds to SQS"
SQS -> SpaceStone : "SQS invokes\nSpaceStone method"
SpaceStone -> DerivativeRodeo : "SpaceStone calls\n DerivativeRodeo"
DerivativeRodeo --> S3 : "Request file for\ntemporary processing"
S3 --> From : "Write requested\n file to\ntemporary storage"
DerivativeRodeo <-- From
DerivativeRodeo -> To : "Generate derivative\n writing to local\n processing storage."
To --> S3 : "Write file\n to S3 Bucket"
DerivativeRodeo <-- To : "Return to DerivativeRodeo\n with generated URIs"
SpaceStone <- DerivativeRodeo : "Return generated\n URIs"
SpaceStone -> SQS : "Optionally enqueue\nfurther work"
@enduml

Given a single original file in a previous home, we are copying that original file (and derivatives) to various locations:

  • From previous home to S3.
  • From S3 to local temporary storage (for processing).
  • Create a derivative temporary file based on existing file.
  • Copying derivative temporary file to S3.

Installation

Add this line to your application's Gemfile:

gem 'derivative-rodeo'

(Due to historical reasons the gem name is derivative-rodeo even though the repository is derivative_rodeo. The following "require" methods will work:

  • require 'derivative_rodeo'
  • require 'derivative-rodeo'
  • require 'derivative/rodeo'

And then execute: $ bundle install

Be aware that you need pdfinfo command line tool installed for this gem to run specs or when using PDF functionality.

Usage

TODO

Technical Overview of the DerivativeRodeo

Generators

Generators are responsible for ensuring that we have the file associated with the generator. For example, the HocrGenerator is responsible for ensuring that we have the .hocr file in the expected desired storage location.

Interface(s)

Generators must have an initializer and build command:

  • .new(array_of_file_urls, output_location_template, preprocessed_location_template)
  • #generated_files (executes the generators actions) and returns array of files
  • #generated_uris (executes the generators actions) and returns array of output uris

Supported Generators

Below is the current list of generators.

  • HocrGenerator :: generated tesseract files from images, also creates monocrhome files as a prestep
  • MonochromeGenerator :: converts images to monochrome
  • CopyGenerator :: sends a set of uris to another location. For example from S3 to SQS or from filesystem to S3.
  • PdfSplitGenerator :: split a PDF into one image per page
  • WordCoordinatesGenerator :: create a JSON file representing the words and coordinates (derived from the .hocr file).

Registered Generators

TODO: We want to expose a list of registered generators

Storage Locations

Storage locations are where we put things. Each location has a specific implementation but is expected to inherit from the DerivativeRodeo::StorageLocation::BaseLocation.

DerivativeRodeo::StorageLocation::BaseLocation.locations method tracks the registered locations.

The location represents where the file should be.

Supported Storage Locations

Storage locations follow a URI pattern

  • file:// :: “local” file system storage
  • s3:// :: AWS’s S3 storage system
  • sqs:// :: AWS’s SQS

Templates

Throughout the code you'll see reference to the following concepts:

  • input_location_template
  • output_location_template
  • preprocessed_location_template

In Process Life Cycle we discussed the input_location, output_location, and preprocessed_location. The concept of the template provides a flexibility in mapping a location to another location

Examples of mapping one file path to another are:

  • I want to copy https://hello.com/world/GUID/file.jpg to file:///tmp/GUID/file.jpg.
  • I want to transform file:///tmp/GUID/file.jpg to file:///tmp/GUID/file.hocr; that is run OCR on an image and write a .hocr file.
  • I want to use the file:///tmp/GUID/file.hocr to generate a file:///tmp/GUID/file.coordinates.json; that is convert the HOCR file to a coordinates.json file.

See DerivativeRodeo::Service::ConvertUriViaTemplateService for more details.

Development

  • Checkout the repository: git clone https://github.com/scientist-softserv/derivative_rodeo
  • Install dependencies: cd derivative_rodeo; bundle install
  • Install git hooks: rake install_hooks
  • Install binaries:
    • pdfinfo: provided by poppler (e.g. brew install poppler)
    • GhostScript (e.g. gs): run brew install gs

Then go about writing your code and documentation.

The git hooks call rake default which will:

Logging in Test Environment

Throughout the DerivativeRodeo we log some activity. In the typical test run, the logs are overly chatty. If you want the more chatty logs run the following: DEBUG=t rspec.

Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/scientist-softserv/derivative_rodeo.