Scout Gear
Scout Gear is the core, higher-level module set of the Scout framework. It bundles rich, production-grade data and workflow tooling built on top of the lower-level primitives in scout-essentials, and adds domain abstractions such as TSV processing, workflows, knowledge bases, entity typing, parallel work queues, and more.
Layering:
- scout-essentials: foundational utilities used everywhere (Path, Open, CMD, IndiferentHash, Persist, Resource, etc.)
- scout-gear (this repo): TSV, Workflow, KnowledgeBase, Entity/Association, WorkQueue, Semaphore, and glue code
- Additional packages:
- scout-camp: remote servers, cloud deployments, web interfaces, cross-site operations
- scout-ai: model training and chat agents
- scout-rig: connect with other languages (e.g., Python)
Related ecosystem:
- Rbbt (Ruby bioinformatics): Many of Scout’s ideas and utilities originated in Rbbt. It still provides a broad set of bioinformatics workflows and tools. See the Rbbt-Workflows organization for many real-world examples and usage patterns:
For module-specific guides, see doc/*.md in this repository (linked below).
- TSV: doc/TSV.md
- Workflow: doc/Workflow.md
- KnowledgeBase: doc/KnowledgeBase.md
- Association: doc/Association.md
- Entity: doc/Entity.md
- WorkQueue: doc/WorkQueue.md
- Semaphore: doc/Semaphore.md
Additionally, Scout Gear reuses and exposes core facilities from scout-essentials. Summaries of those core modules are included below for convenience.
How command-line interfaces work (scout …)
Scout provides a single “scout” command that discovers and runs nested subcommands from any installed Scout package. Scripts are discovered using the Path subsystem across PATH-like roots, enabling workflows or packages to inject their own commands.
Basics:
- The CLI resolves terms left-to-right until a file is found under a scout_commands tree.
- Example: scout workflow task runs scout_commands/workflow/task
- Example: all TSV-related scripts are under scout_commands/tsv and can be listed with scout tsv
- If the path resolves to a directory instead of a script, a list of available subcommands in that directory is shown.
- Remaining ARGV is parsed by the selected script using SimpleOPT (SOPT) or compatible parsers.
- Because discovery uses Path maps, commands contributed by other packages or installed workflows are automatically found.
See the per-module CLI sections below for TSV, Workflow, and KnowledgeBase.
Scout Essentials: Core building blocks
Scout Gear depends on the following main modules from scout-essentials. You’ll use these directly for filesystem/resource orchestration, external command execution, caching, and options handling.
Path
doc/Path.md
Path is a lightweight, annotation-enabled “smart string” for composing and locating project resources across multiple search maps (current/user/global/lib/tmp, etc.). It integrates with Open and Persist.
Highlights:
- Path.setup("str") turns a String into a Path with join via [], /, or method_missing (path.foo.bar)
- Map logical locations to physical roots with path maps; find the first match across map order with path.find (and path.find_all)
- Filename helpers: get/set/replace/unset extensions; sanitize filenames; relative paths
- Directory helpers: glob and glob_all over maps; dirname/basename; realpath; newer?
- Digest summaries: path.digest_str summarizes files/dirs for logging/debugging
Usage:
p = Path.setup('share/data/myfile')
p.find # resolve across configured maps
p[:subdir, :file] # joins => share/data/subdir/file
Open
doc/Open.md
Open unifies file/stream/remote I/O, atomic writes, pipes/tees/FIFOs, (bg)zip helpers, rsync/sync, and lock handling.
Highlights:
- Open.open/read/write with auto-(de)compression for .gz/.bgz/.zip and remote urls (wget/ssh)
- Streams: open_pipe, tee_stream, consume_stream, with_fifo
- Safe writes: sensible_write (tmp + atomic rename + optional locks)
- Remote: wget with caching, ssh/scp, digest_url, remote cache
- Filesystem: mkdir/mkfiledir, mv/cp/ln/link_dir, rm/rm_rf, same_file?, exists?, writable?
- Locking: Open.lock wraps a robust Lockfile (NFS-safe) with refresh/timeout/steal
Example:
Open.sensible_write("out.txt", Open.open("http://example.com"))
Open.with_fifo { |fifo| ... }
Open.rsync("src/", "user@server:dst/", delete: true)
CMD
doc/CMD.md
CMD wraps Open3.popen3 with robust patterns for streaming, stderr logging, stdin feeding, auto-join of producers, and tool discovery/installation.
Highlights:
- CMD.cmd("tool args", pipe: true, in: io_or_string, stderr: Log::HIGH, autojoin: true)
- ConcurrentStream-enabled stdout with join/error propagation
- Convenience: CMD.bash("bash -l -c '...'"), cmd_pid/cmd_log
- Tool registry: CMD.tool, CMD.get_tool (auto-install via conda or producers), version scanning
Example:
io = CMD.cmd("cut", "-f" => 2, "-d" => " ", in: "a b", pipe: true)
io.read # => "b\n"; io.join
IndiferentHash
doc/IndiferentHash.md
Hash mixin for indifferent access (string/symbol keys equal), deep-merge, options parsing, and string<->hash conversions.
Highlights:
- IndiferentHash.setup(hash) to extend a single hash instance
- Access with h[:a] == h["a"]; delete/include? are indifferent
- Helpers: deep_merge, values_at with indifferent keys, slice, except
- Options utilities: parse_options, process_options, positional2hash, hash2string/string2hash
Example:
opts = IndiferentHash.parse_options('limit=10 title="A title"')
opts[:title] # => "A title"
Persist (core serialization/caching)
doc/Persist.md (essentials)
Typed serialization (json/yaml/marshal/binary/arrays), atomic saves, and the high-level persist pattern with locking and streaming.
Highlights:
- Persist.save/load(obj, file, type)
- Persist.persist(name, type, dir: ...) { compute_or_stream }
- Locking and tmp-to-final atomic writes
- Streaming tee: one copy to file, one to caller
- Memory cache: Persist.memory(name) { ... }
- Helpers to parse YAML/JSON/Marshal via Open
Example:
val = Persist.persist("expensive", :json) { compute_hash }
# subsequent calls load cached JSON unless :update or stale
Resource
doc/Resource.md
Resource system to claim and produce files on demand (string/proc/url/rake/installers), integrated with Path/Open and locking.
Highlights:
- claim path => (:string, :proc, :url, :rake, :install)
- Produce on demand via path.produce and path.open/read
- Rake integration: drive file tasks/rules to generate outputs
- Install software into a per-resource “software” dir and update env
Example:
module MyPkg
extend Resource
claim root.tmp.test.hello, :string, "Hello"
end
MyPkg.tmp.test.hello.read # produces if missing, then reads
Other essentials you’ll encounter:
- Annotation / AnnotatedArray / NamedArray: lightweight typed attributes on objects and arrays; named tuple-style rows
- ConcurrentStream: concurrency-aware streams with join/abort/callbacks
- SimpleOPT (SOPT): tiny CLI option DSL/parser; used by scout commands
- Log: leveled, colored logging; progress bars; fingerprint utilities
- TmpFile: temp files/dirs and stable tmp path generator for caches
Scout Gear modules
Scout Gear builds on essentials to deliver domain abstractions and engines.
TSV
doc/TSV.md
A flexible, typed table abstraction with robust parser, streaming dumper/transformer, parallel traversal, joins/attachments, identifier translation, on-disk persistence (TokyoCabinet/Tkrzw), and range/position indices.
Highlights:
- Shapes: :double, :list, :flat, :single; key_field + fields
- Parse TSV/CSV from files/streams/strings with rich header options (sep, type, cast, merge)
- Dumper/Transformer for streaming pipelines
- TSV.traverse(obj, cpus: N, into: …) for parallel iteration
- Attach, change_key, change_id, translate via identifier indices
- Persistence via TSVAdapter over HDB/BDB/Tkrzw/FWT/PKI/Sharder
- Streaming paste/concat/collapse utilities; filters with persisted sets
Example:
tsv = TSV.open(path, persist: true, type: :double)
tsv.attach(other, complete: true)
index = TSV.index(tsv, target: "FieldA")
CLI (scout tsv):
- Scripts live under scout_commands/tsv; list with scout tsv
- Run a specific subcommand: scout tsv [options] [args...]
- If you hit a directory, available subcommands are listed
- Subcommands parse options with SOPT (see each script’s help)
Workflow
doc/Workflow.md
A lightweight workflow engine. Define tasks with typed inputs and dependencies, create jobs (Steps), and run them with persistence, streaming, provenance, and orchestration under resource rules.
Highlights:
- input/dep/task DSL with helper methods; task_alias and overrides
- Jobs (Step): run/load/stream/join, info files, files_dir, provenance
- Orchestrator: schedule dependent jobs under cpus/IO constraints; retry recoverable errors; archive/erase deps per rules
- EntityWorkflow: entity-centric tasks and properties
- Queue helpers to enqueue and process jobs
Example:
module Baking
extend Workflow
task :say => :string do |name| "Hi #{name}" end
end
Baking.job(:say, "Miguel").run # => "Hi Miguel"
CLI (scout workflow):
- List workflows: scout workflow list
- Run a task: scout workflow task [--jobname NAME] [input options...]
- Options include --fork, --nostream, --update, --printpath, --provenance, --clean, --recursive_clean, --override_deps, --deploy (serial|local|queue|SLURM|server)
- Show job info: scout workflow info <step_path> [--inputs|--recursive_inputs]
- Provenance: scout workflow prov <step_path> [--plot file.png] […]
- Trace execution: scout workflow trace [options]
- Process queue: scout workflow process [filters] [--continuous] [--produce_cpus N] […]
You can also dispatch workflow-specific custom commands via:
- scout workflow cmd … (discovers scripts under /share/scout_commands/workflow)
KnowledgeBase
doc/KnowledgeBase.md
A thin orchestrator around Association, TSV, Entity, and Persist to register multiple association databases, normalize/index them, query/traverse across them, manage entity lists, and generate markdown descriptions.
Highlights:
- Register databases with source/target specs and identifier files
- get_database/get_index (BDB-backed) with undirected options
- Query: all, subset (children/parents/neighbours), identify/translate entities
- Lists: save/load/delete/enumerate typed lists
- Traversal DSL: multi-hop path finding with wildcards/conditions
- Markdown descriptions from registry/README files
Example:
kb = KnowledgeBase.new(Path.setup("var/kb"), "Hsa")
kb.register :brothers, datafile_test(:person).brothers, undirected: true
kb.children(:brothers, "Miki") # => ["Miki~Isa", ...]
CLI (scout kb):
- Configure KB: scout kb config [options]
- Register DB: scout kb register [options]
- Declare entities: scout kb entities <identifier_files>
- Show info: scout kb show []
- Query: scout kb query <entity_spec>
- Lists: scout kb list [<list_name>]
- Traverse: scout kb traverse [options] "<rules,comma,separated>"
Association
doc/Association.md
Utilities to normalize source/target field specifications from TSVs, open normalized association databases with optional identifier translation, and build pairwise “source~target” indices (optionally undirected). Also includes AssociationItem for entity-like behavior over pair strings and utilities to build incidence/adjacency matrices.
Example:
idx = Association.index(file, source: "=>Name", target: "Parent=>Name", undirected: true)
idx.match("Clei") # => ["Clei~Guille"]
idx.to_matrix # boolean incidence matrix
Entity
doc/Entity.md
Annotate plain values or arrays as entities with behavior-rich “properties”, automatic format mapping, identifier translation (Entity::Identified), array-aware property batching/caching, and persistence for property results via Persist.
Example:
module Person
extend Entity
property :greet => :single do "Hi #{self}" end
end
Person.setup("Miki").greet
WorkQueue
doc/WorkQueue.md
A multi-process work pipeline (forked workers + semaphore-guarded sockets) to parallelize processing over a stream of inputs, with robust error propagation.
Example:
q = WorkQueue.new(4){|x| x * 2}
out = []; q.process{|y| out << y}
(1..100).each{|i| q.write i}; q.close; q.join
Semaphore (ScoutSemaphore)
doc/Semaphore.md
Concurrency helpers based on POSIX named semaphores (via RubyInline C bindings), plus higher-level helpers to bound concurrency with forks/threads.
Example:
ScoutSemaphore.with_semaphore(2) do |sem|
ScoutSemaphore.synchronize(sem){ critical_work }
end
Examples and further reading
- This repository’s docs directory provides in-depth guides for each module:
- TSV: doc/TSV.md
- Workflow: doc/Workflow.md
- KnowledgeBase: doc/KnowledgeBase.md
- Association: doc/Association.md
- Entity: doc/Entity.md
- WorkQueue: doc/WorkQueue.md
- Semaphore: doc/Semaphore.md
- For numerous end-to-end examples and real datasets, explore the Rbbt-Workflows organization:
- For foundational utilities (Path, Open, CMD, IndiferentHash, Persist, Resource, etc.), consult the scout-essentials documentation:
- Those modules are summarized above and used pervasively across Scout Gear.
Notes
- Streaming everywhere: many APIs return ConcurrentStream-enabled IOs. Always read to EOF and join (or rely on autojoin) to ensure producers exit and errors are surfaced.
- Atomicity and locking: Open.sensible_write and Persist.persist use tmp+mv and lockfiles to provide robust cross-process behavior.
- Discovery and composition: the Path subsystem and Resource claims make it easy to build portable projects with on-demand production of resources and discoverable commands.