Project: cwords - The Ruby Toolbox

Cwords is a software tool that measure correlations of short oligonucleotide sequence ('words') occurrences and i.e. expression changes in a given experiment. In summary, it produces a statistic that a given word is overrepresented near the extremity of a ranked list of sequences.

1. REQUIREMENTS

All software components require Ruby (>=1.8.6, http://www.ruby-lang.org/) and JRuby (>=1.4.0, http://jruby.org/).
The software has only been tested on a Unix platform.

2. INSTALL

* Install Ruby (www.ruby-lang.org, check if it is already installed: 'ruby -v')
* Install Java (www.java.com, check if it is already installed: 'java -version'),
* Install JRruby (www.jruby.org, check if it is already installed: 'jruby -v'),
make sure that you have the 'jruby' command in your path.
* Install Rubygems (www.rubygems.org, check if it is already installed: 'gem -v')

* And finally, install cwords (http://rubygems.org/gems/cwords):

> gem install cwords
> cwords --help

2. USAGE

Below is a short summary of how to use the software. The full set of options for each script can be be listed from the command line by using the '--help' flag.

The software can compute word correlations with or without the use of a precomputed word database. The advantage of using the precomputed word database is i) sequences are processed only when constructing the word database and ii) lower memory usage.

2.1. Preparing a word database

First, a word database is computed from a sequence file, i.e. 3'UTR sequences. I.e. assuming that the current directory is the base directory of this software library:

cwords_mkdb -w 4,5 -s hsa_3utrs.fa -p 8 --bg 2

This commands computes over representation of all 4 and 5-mers ('-w 4,5') in all sequences relative to a di-nucleotide sequence background model ('--bg 2'). Any higher-order sequence background model can be used ('--bg'), however it is recommended not to use a background order higher than tri-nucleotides for shorter sequences (such as 3'UTRs).
This command is computationally very intensive and it is therefore highly recommended that the analysis is carried in parallel using the '-p' option, specifying the number of processes to be used.
The result of the above command is a word database that is created in the db/ sub-directory. In the above example the database is named 'hsa_3utr_bg2'

2.2. Correlation analysis using a precomputed word database

The subsequent correlation analysis now uses this pre-computed word database:

cwords --db hsa_3utr_bg2 1 -w 4,5 -p 100 -r gene_expression_ranks.rnk

The '.rnk' file is a tab-delimited file with a gene identifier (matching identifiers in the sequence file) and a rank metric (i.e. expression change in a given experiment). Correlations are computed relative to permutations of the ranked gene list ('-p' option).

The final output of this analysis produces a summary of the top correlating words (a list for each end of the ranked list), i.e. words over-represented in beginning of list:

Top 10 words
rank word RS z-score p-value fdr ledge
1 ttatc 13.92 4.00 2.00e-03 7.94e-01 449
2 atccc 12.79 3.78 3.99e-03 6.59e-01 417
3 gtaatc 8.24 3.75 2.00e-03 5.01e-01 280
4 gtgaaa 6.77 3.68 5.99e-03 4.56e-01 137
5 cctat 10.26 3.61 3.99e-03 4.45e-01 384
6 tttatc 11.93 3.59 3.99e-03 3.88e-01 449
...

The 'z-score' is a correlation statistic for the given word after correction for correlations obtained from random gene list orderings. The 'fdr' (false discovery rate) is the estimated proportion of false discoveries for the given z-score threshold. Finally, 'ledge' is the leading-edge which denotes the position in the gene list where the maximum imbalance was measured; genes before this threshold are relatively enriched for the word compared to genes after the threshold.

2.3. On-line correlation analysis

The program can perform correlation analysis on-line without the use of a word database. A sequence file is specified using the (-s) flag.

# binary scoring, presence/absence of word
cwords -u 0 -w 5,6 -x -r seqweight.rnk -s seqs.fa -p 50 -c bin

# scoring by number of word occurrences
cwords -u 0 -w 5,6 -x -r seqweight.rnk -s seqs.fa -p 50 -c obs

Background correction, this is much slower (hours!). If your sequences are long (>500) or you repeatedly analyze the same set of sequences, it is better to first compute a word database for the set of sequences that can be reused (I have another script for this task if it is needed). Observed word occurrences are now measured relative to word occurrences in shuffled sequences.

# Mono-nucleotide dependency model
cwords -u 0 -w 5,6 -x -r seqweight.rnk -s seqs.fa -p 50 -q 50 -c pval -b 1

# Di-nucleotide model
cwords -u 0 -w 5,6 -x -r seqweight.rnk -s seqs.fa -p 50 -q 50 -c pval -b 2

2.4. Parameters

-x : rank all sequences together
-w : word lengths (comma separated), it is not recommended to go >7
-r : rank file, each line consists of sequence id and weight (i.e. expression change) separated by tabs.
-s : sequence file, fasta format, id's must match rank file
-q : number of sequence shuffles when using sequence background correction (-c 'pval', default)
-u : -u 0, specifies that results for all words are dumped in a file
-b : background nucleotide dependency model
-t : specifies the number of threads that should be used in the computation. In a multi-core environment the running time will decrease and memory usage will increase almost linearly with the number of enabled threads.

Caution: The number of permutations (-p), sequence shuffles (-q) and word length (-w) have a significant impact on the running time, but if you can wait try higher values.

2.5. Scripts for additional follow-up analysis.

The 'misc/' directory includes two scripts for follow-up analysis of the word correlations.

cluster_words.rb:
Given the output from 'cwords', this script clusters the top correlating words based on a number of different parameters (see '--help' option).

complementary_words.rb:
Computes the top correlating words that are complementary to a reference sequence (i.e. a mature miRNA sequence). Calculates the number of complementary words for each position of the sequence and compares to random and shuffled sequences.

3. LICENSE

Copyright (c) 2010, Anders Jacobsen.
The software is open source and released under the MIT license (license is included).