Every repository with this icon (
Every repository with this icon (
Home
LocusTree – Ruby library to search genomic loci
“Features should not know where they are” (James Bonfield, WTSI – paraphrased)
Problem setting
Imagine an application that handles features on a chromosome (e.g. genes) and lets you select a region for which to return all of those features. As the mapping information is normally stored in the feature (having a start and an end on a chromosome), you have to go through all features sequentially to determine which ones are actually in your region and should be displayed.
Or imagine a visualization like Google Maps but displaying quantitative information rather than streets and roads. Let’s say the height of the terrain. Also suppose that you have data for every square meter in the UK. When you look at the whole of the UK at once in a 800×600 pixel display you won’t be able to show all raw data but will have to average the data for large regions of the UK.
A solution
What you can do to solve the above two problems, is build a hierarchical tree, where the top node contains an aggregate (e.g. average or count) for the complete dataset (i.c. terrain height), and a small group of non-intersecting leaf nodes. These leaf nodes together should make up the whole top node. Each leaf node again contains the average of all data within it, and is subdivided into ever smaller leaf nodes, and so on.
When looking at a map of the whole of the UK, we know that we need height information for 800×600=480,000 pixels. Instead of loading the whole height dataset at this point, we just take that layer of the tree that contains 480,000 leafs or is the closest to it.
Example data structure
Suppose you have 20 ranges: 10..15, 20..25, 30..35, 40..45, … A LocusTree with a bin size of 3 would look like this (a ‘value’ for each of the ranges (e.g. read depth) is added between parentheses):
LEVEL 0 LEVEL 1 LEVEL 2 LEVEL 3 (=root)
10..15 (1) -+
20..25 (2) |- 10..35 (2) -+
30..35 (3) -+ |
40..45 (4) -+ |
50..55 (5) |- 40..65 (5) |- 10..95 (5) -+
60..65 (6) -+ | |
70..75 (7) -+ | |
80..85 (8) |- 70..95 (8) -+ |
90..95 (9) -+ |
100..105 (10) -+ |
110..115 (11) |- 100..125 (11) -+ |
120..125 (12) -+ | |- 10..205 (10.5)
130..135 (13) -+ | |
140..145 (14) |- 130..155 (14) |- 100..185 (14) |
150..155 (15) -+ | |
160..165 (16) -+ | |
170..175 (17) |- 160..185 (17) -+ |
180..185 (18) -+ |
190..195 (19) -+ |
200..205 (20) -+- 190..205 (19.5) -+- 190..205 (19.5) -+
If you would have to display this whole region, but only have 7 pixels to do it in, you would use the data in LEVEL 1. It is no use trying to cram in all raw data from LEVEL 0 because it can’t all be shown. For each parent node, the value for that node is the average of the child nodes weighted by the number of LEVEL 0 nodes that are covered by each child.
Terms
Because of the nature of the beast, several objects have to play together to provide this functionality:
- LocusTree::Node: This is the bin that contains data.
- LocusTree::Level: All bottom nodes are level 0, and are binned into a bin level 1, which in turn…
- LocusTree::Tree: Every independent scaffold (i.e. chromosome, contig, linkage_group, …) has its own tree, because it does not make sense to search for loci that span different chromosomes…
- LocusTree::Container: The main container with all data. There’s only one container.
Usage
Input data has to be in GFF format.
require 'locus_tree'
container = LocusTree::Container.new(100, 'locus_tree_100.sqlite3', 'genes.txt')
positive_nodes = container.query('2', 143570750, 143570790, 100)
positive_nodes.each do |node|
puts node.to_s + "\t" + node.value.to_s
end
puts container.query_single_bin('24', 49_050, 49_500).to_s
The above creates a database file (here named ‘locus_tree_100.sqlite3’) which can be used afterwards. So we don’t need to recreate the index.
require 'locus_tree'
container = LocusTree::Container.load_structure('locus_tree_100.sqlite3')
positive_nodes = container.query('2', 143570750, 143570790, 100)
positive_nodes.each do |node|
puts node.to_s + "\t" + node.value.to_s
end
puts container.query_single_bin('24', 49_050, 49_500).to_s
Creating this index for 37,000 loci with a node size of 100 elements on my laptop took less than 4 minutes.
Sample
There’s a sample directory with three scripts: one to build a collection of raw data, one to build the index and one to do a search. The example data is 1 million readdepth datapoints.
A very crude benchmark on different binsizes when loading 36,653 genes:
- binsize = 1000
- database size = 2.4Mb
- building index takes 3 minutes 34 seconds
- searching index (chr1 from pos 500 to 7000) takes milliseconds
- binsize = 100
- database size = 3.1Mb
- building index takes 3 minutes 57 seconds
- searching index takes milliseconds
To do
- Number of subnodes per node should not be the same as the number of basepairs in a node at the bottom level. We’d probably want e.g. 1kb per bin at the bottom level, and then group bins per two.
- Add methods to insert datapoints
- Add additional ways to aggregate data other than the counts
- Add inline documentation







