Project

tripleloop

0.0
No commit activity in last 3 years
No release in over 3 years
Simple tool for extracting RDF triples from Ruby hashes
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
 Dependencies

Development

>= 0
~> 0.4.0
>= 0
~> 2.12.0

Runtime

>= 0
 Project Readme

Tripleloop

A DSL for extracting data from hash-like objects into RDF statements (i.e. triples or quads).

Usage

Start by creating some extractor classes. Each extractor maps one or several document fragments to RDF statments.

class ArticleCoreExtractor < Tripleloop::Extractor
  bind(:doi) { |doc| RDF::DOI.send(doc[:doi]) }

  map(:title)          { |title|   [doi, RDF::DC11.title, title, RDF::NPGG.articles] }
  map(:published_date) { |date |   [doi, RDF::DC11.date, Date.parse(date), RDF::NPGG.articles] }
  map(:product)        { |product| [doi, RDF::NPG.product, RDF::NPGP.nature, RDF::NPGG.articles] }
end

class SubjectsExtractor < Tripleloop::Extractor
  bind(:doi) { |doc| RDF::DOI.send(doc[:doi]) }

  map(:subjects) { |subjects|
    subjects.map { |s|
      [doi, RDF::NPG.hasSubject, RDF::NPGS.send(s) ]
    }
  }
end

Once defined, extractors can be composed into a DocumentProcessor class.

class NPGProcessor < Tripleloop::DocumentProcessor
  extractors :article_core, :subjects
end

The processor can then be fed with a collection of hash like documents and return RDF data grouped by extractor name.

data = NPGProcessor.batch_process(documents)
=> { :article_core => [[<RDF::URI:0x00000002651ce0(http://dx.doi.org/10.1038/481241e)>, 
                        <RDF::URI:0x1b0c060(http://purl.org/dc/elements/1.1/title)>, 
                       "Developmental biology: Watching cells die in real time"],...], 
     :subjects => [...] }

Notice that the output retuned by the batch_process method is still a plain ruby data structure, and not an instance of RDF::Statement. The actual job of instantiating RDF statements and writing them to disc is in fact responsability of the Tripleloop::RDFWriter class, which can be used as follows:

Tripleloop::RDFWriter.new(data, :dataset_path => Pathname.new("my-datasets")).write

This will create the following two files:

  • my-dataset/article_core.nq
  • my-dataset/subjects.nq

When #write method is executed, RDFWriter will internally generate RDF triples, delegating the RDF serialisation job to RDF.rb's RDF::Writer. The only logic involved in the implementation of Tripleloop::RDFWriter#write concerns the assignment of the right RDF serialisation format and file extension. When all the RDF statements generated by an extractor do specify also a graph (as in the example above), the writer will use the RDF::NQuads::Writer, falling back to RDF::NTriples::Writer otherwise.