0.0
No commit activity in last 3 years
No release in over 3 years
SimpleBioC is a simple parser / builder for BioC data format. BioC is a simple XML format to share text documents and annotations. You can find more information about BioC from the official BioC web site (http://www.ncbi.nlm.nih.gov/CBBresearch/Dogan/BioC/)
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
 Dependencies

Development

~> 1.3
>= 12.3.3
~> 3.2
~> 0.1
>= 0.9.20

Runtime

~> 2.3
~> 1.6
 Project Readme

SimpleBioC

SimpleBioC is a simple parser / builder for BioC data format. BioC is a simple XML format to share text documents and annotations. You can find more information about BioC from the official BioC web site (http://www.ncbi.nlm.nih.gov/CBBresearch/Dogan/BioC/)

Feature:

  • Parse & convert a BioC document to an object instance compatible to BioC DTD
  • Use plain ruby objects for simplicity
  • Build a BioC document from an object instance
  • Convert BioC to PubAnnotation JSON

Installation

Add this line to your application's Gemfile:

gem 'simple_bioc'

And then execute:

$ bundle

Or install it yourself as:

$ gem install simple_bioc

Simple Usages

Include library

require 'simple_bioc'

Parse with a file name (path)

collection = SimpleBioC::from_xml(filename)

Traverse & Manipulate Data. Data structure are almost the same as the DTD. Please refer library documents and the BioC DTD.

    puts collection.documents[2].passages[0].text

Build XML text from data

    puts SimpleBioC::to_xml(collection)

Convert PubAnnotation JSON from data

    puts SimpleBioC::to_pubann(collection, {
      sourcedb: 'PubMed', 
      target: 'http://pubannotation.org/docs/sourcedb/PubMed/sourceid/18034444', 
      project: 'Ab3P-abbreviations'
    }))

Options

Specify set of <document>s to parse

You can parse only a set of document elements in a large xml document instead of parsing all the document elements. It may decrease the processing time. For example, the following code will return a collection with two documents ("1234", "4567").

    collection = SimpleBioc::from_xml(filename, {documents: ["1234", "4567"]})

No whitespace in output

By default, outputs of SimpleBioC::to_xml() will be formatted with whitespace. If you do not want this whitespace, you should pass 'save_with' option with 0 to the to_xml() function.

    puts SimpleBioC::to_xml(collection, {save_with:0})

Sample

More samples can be found in Samples directory

    require 'simple_bioc'

    # Sample1: parse, traverse, manipulate, and build BioC data
    require 'simple_bioc'
    
    # parse BioC file
    collection = SimpleBioC.from_xml("../xml/everything.xml")
    
    # the returned object contains all the information in the BioC file
    # traverse & read information
    collection.documents.each do |document|
      puts document
      document.passages.each do |passage|
        puts passage
      end
    end
    
    # manipulate 
    doc = SimpleBioC::Document.new(collection)
    doc.id = "23071747"
    doc.infons["journal"] = "PLoS One"
    collection.documents << doc
    
    p = SimpleBioC::Passage.new(doc)
    p.offset = 0
    p.text = "TRIP database 2.0: a manually curated information hub for accessing TRP channel interaction network."
    p.infons["type"] = "title"
    doc.passages << p
    
    
    # build BioC document from data
    xml = SimpleBioC.to_xml(collection)
    puts xml

Sample2: PubAnnotation Converter (convert_pubann.rb)

# convert document to PubAnnotation JSON
require 'simple_bioc'

if ARGF.argv.size < 1
  puts "usage: ruby convert_pubann.rb {filepath}"
  exit
end

collection = SimpleBioC::from_xml(ARGF.argv[0])
puts SimpleBioC::to_pubann(collection, {
  sourcedb: 'PubMed', 
  target: 'http://pubannotation.org/docs/sourcedb/PubMed/sourceid/18034444', 
  project: 'Ab3P-abbreviations'
})

Contributing

  1. Fork it
  2. Create your feature branch (git checkout -b my-new-feature)
  3. Commit your changes (git commit -am 'Add some feature')
  4. Push to the branch (git push origin my-new-feature)
  5. Create new Pull Request

LICENSE

Copyright © 2013, Dongseop Kwon

Released under the MIT License.