Project

bio-twobit

0.0
No release in over a year
This is a Ruby binding for lib2bit(https://github.com/dpryan79/lib2bit), which provides high-speed access to genomic data in 2bit file format.
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
 Dependencies
 Project Readme

bio-twobit

Gem Version test dics DOI

Bio::TwoBit is a Ruby interface to 2bit files.

Ruby bindings to lib2bit / py2bit.

2bit files are used to store and index DNA sequences, usually of entire reference genomes. The 2bit format is a compact binary representation of DNA sequences that is used by the UCSC Genome Browser.

Installation

gem install bio-twobit

Linux and macOS are supported. Windows is currently not supported.

Usage

require 'bio/twobit'

# hg38 = Bio::TwoBit.open("hg38.2bit")
hg38 = Bio::TwoBit::Hg38.new

hg38.path
# "hg38.2bit"

hg38.info
# {"file_size"=>818064875,
# "nChroms"=>640,
# "sequence_length"=>3272116950,
# "hard_masked_length"=>161368694,
# "soft_masked_length"=>0}

hg38.chroms.take(5)
# [["chr1", 248956422],
# ["chr2", 242193529],
# ["chr3", 198295559],
# ["chr4", 190214555],
# ["chr5", 181538259]]

Fetch a sequence

hg38.sequence("chr1", 50000, 50050)
# "AAACAGGTTAATCGCCACGACATAGTAGTATTTAGAGTTACTAGTAAGCC" # length 50
  • The first number is the (0-based) position on the chromosome/contig where the sequence should begin.
  • The second number is the (1-based) position on the chromosome where the sequence should end.
hg38.bases("chr1", 10000, 10100)
# {"A"=>0.34, "C"=>0.49, "T"=>0.17, "G"=>0.0}

hg38.bases("chr1", 10000, 10100, fraction: false)
# {"A"=>34, "C"=>49, "T"=>17, "G"=>0}

hg38.bases("chr1") 
# {"A"=>0.26940569141052323,
# "C"=>0.19302592242428676,
# "T"=>0.2701041550155312,
# "G"=>0.19325280952182064}

hg38.hard_masked_blocks("chr1", 0, 1000000)
# [[0, 10000], [207666, 257666], [297968, 347968], [535988, 585988]]

The 2-bit file must be closed explicitly. Alternatively, you can use a block. Even if it is not closed, it will probably be closed by GC and there will be no problem. But this is not guaranteed.

# Explicitly close the file.
tb = Bio::TwoBit.open("test/fixtures/foo.2bit")
tb.close

# You can also use blocks.
Bio::TwoBit.open("test/fixtures/foo.2bit") do |t|
  p t.info
end
tb.closed? # true / false

If you would like to include information about soft-masked bases, you need to specify masked: true

tb = Bio::TwoBit.open("test/fixtures/foo.2bit")
tb.sequence("chr1", 60, 72)
# => "GTAGCTAGCTGA"

tb = Bio::TwoBit.open("test/fixtures/foo.2bit", masked: true)
tb.sequence("chr1", 60, 72)
# => "GTagctagctGA"
tb.soft_masked_blocks("chr1")
# => [[62, 70]]
tb.masked? # true / false

hg19, hg38, hs1...

Some reference genomes are provided as classes in advance. These classes automatically download 2bit files from the UCSC site into a cache directory upon first use.

hg19 = Bio::TwoBit::Hg19.new
hg38 = Bio::TwoBit::Hg38.new
hs1  = Bio::TwoBit::Hs1.new

Adding a new reference genome is easy. Add the id of the genome you want to use here.

git clone https://github.com/kojix2/bio-twobit
vi lib/bio/twobit/references/template.erb # Add your id to ids list.
ruby lib/bio/twobit/references/template.erb
rake install

If you want to use 2-bit files from locations other than UCSC, create your own classes here.

Pull requests are welcome.

Development

Bug reports and pull requests are welcome on GitHub at https://github.com/kojix2/bio-twobit.

Do you need commit rights to my repository?
Do you want to get admin rights and take over the project?
If so, please feel free to contact us @kojix2.

License

The gem is available as open source under the terms of the MIT License.

Code from Red Datasets is used for automatic file download and caching. (The MIT license)