Re: Merging intervals; Chaining intervals

Good luck (*) with IntervalTree.

The idea is that you insert your items in a tree, their interval being the discriminating characteristic.

Traversing the tree will be in the same order as if you lined up all those intervals (on a line).

In traverse()'s order, calculate distance between an item's "before" and "after" neighbour and develop the logic to merge them or not.

Here is a quick demo, notice the line-up of the items at the end:

#!/usr/bin/perl

#
# author: Bliako
# date: 10/09/2018
# for: https://perlmonks.org/?node_id=1222031
#

use strict;
use warnings;

use IntervalTree;
use Data::Dumper;

# create the tree
my $tree = IntervalTree->new();

while(<DATA>){
        chomp;
        next if $_ =~ /^\s*$/;
        my @items = split /\s+/;
        $tree->insert(@items[1,2,3]);
}
close(DATA);
print_tree($tree);

# print all the intervals
sub     print_tree {
        my $atree = shift;
        $atree->traverse(
                # specify a func to be run on every node as returned b
+y traverse()
                sub {
                        my $anode = $_[0];
                        print "traverse() : ".$anode->{interval}." => 
+".$anode->str()."\n";
                }
        );
}

__DATA__
Chr1 30000 32000 A3
Chr1 28000 29000 B2
Chr1 12000 18500 B1
Chr1 1000 4000 A1
Chr1 42000 44000 A4
Chr1 15000 22000 A2
[download]

Result:

traverse() : (A1) => Node(1000, 4000)
traverse() : (B1) => Node(12000, 18500)
traverse() : (A2) => Node(15000, 22000)
traverse() : (B2) => Node(28000, 29000)
traverse() : (A3) => Node(30000, 32000)
traverse() : (A4) => Node(42000, 44000)
[download]

(*) : personally I found this module very frustrating to work with as it mixes together Node objects, Interval objects and Interval names. I had to modify the source of _seek_left and _seek_right to save $self rather than $self->{interval}. But for your modest need, the above will do nicely. There are even worst alternatives, search at cpan.

If you are a Bioinformatician and have more serious bioinformatics needs I am available for hire - in the spirit of appeasing and keeping content the general anonymous public. If you are not, well I can always help for free when I can as I always do.

bw, bliako

Edit - Clarification: my program lines up all your items in a line wrt their intervals (compare Result above and your sample output A1, B1, A2, B2, A3, A4.). The program will not do the merging as I was too lazy to read your requirements and translate to code. Now all you have to do is to check each item with its 2 or 4 neighbours (for when A neighbours A) following your rules. The important thing is that the "line-up" is done using the Interval Tree in O(n log n) time and will persist for classifying unknown intervals too.

Comment on Re: Merging intervals; Chaining intervals Select or Download Code

Replies are listed 'Best First'.
Re^2: Merging intervals; Chaining intervals by onlyIDleft (Scribe) on Sep 19, 2018 at 23:31 UTC
Thank you bliako, and apologies for my delayed response. Was forced to travel due to hurricane Florence! Do you have suggestions for how to parse the traverse output to a. pair start-stop intervals between types of features, not within a type, within user specified distance of one another b. chain intervals so that there must be be at least one each of both feature types, possibly more, also within user specified distance of one another	[reply]
Re^3: Merging intervals; Chaining intervals by bliako (Abbot) on Sep 20, 2018 at 12:22 UTC
When all the intervals are sorted on a line, i.e. as per traverse()'s output: A1, B1, A2, B2, A3, A4, I would create a sub which takes in 4 intervals as params: the current interval, its previous and its next and also a user-specified min-distance. For example: `sub merge { my ($pre,$cur,$nex, $mindist) = @_; # here write your logic for merging either cur+pre or cur+nex or no +merging at all given mindist my $decicion_to_merge = ... if( $decicion_to_merge == 1 ){ ... merge the intervals ... } }` [download] then make a sub to return all sorted Intervals as an array: `sub get_sorted_intervals_as_array { my $atree = shift; my @ret = (); $atree->traverse( # specify a func to be run on every node as returned b +y traverse() sub { my $anode = $_[0]; print "traverse() : ".$anode->{interval}." => +".$anode->str()."\n"; push(@ret, $anode); } ); return @ret }` [download] and here is how the merging can happen: `my @ints = get_sorted_intervals_as_array($tree); for($i=1;$i<scalar(@ints)-1;$i++){ $pre = $ints[$i-1]; $cur = $ints[$i]; $nex = $ints[$i+1]; merge($pre, $cur, $nex, 1234); # 1234 user-specified distance }` [download] Now the above is a very rough sketch and still requires you to encode your logic into the merge() sub. It also still needs you to decide what a 'merge' means: does it create a new interval, with new labels and discards the two parent intervals? For complex interval you can create a data structure to hold the data, maybe a class or a very simple hashtable. Here is a sketch as a hashtable, it's fairly easy to convert it to a Perl class once you agree on the fields: `my %Interval = ( 'chromosome' => 'Chr1', 'from' => 1000, 'to' => 4000, # to represent A1 (?): 'type' => 'A', 'id' => 1 );` [download] Then you can start making subs to operate on these intervals, e.g. : `sub can_merge { my ($i1, $i2) = @_; return $i1->{'type'} ne $2->{'type'} } sub merge { my ($i1, $i2, $distance) = @_; my $newi = {}; if( $distance ... ){ return undef } # no merge happened because dist +ance etc. if( ! can_merge($i1, $i2) ){ return undef } # no merge, types incomp +atible. $newi->{from} = List::Util::min($i1->{from}, $i2->{from}); $newi->{to} = List::Util::max($i1->{to}, $i2->{to}); ... return $newi # return the new interval representing the merge }` [download] So, the bottom line: if you have complex merging rules or may be your rules become complex in the future or you may want to experiment on different rules to see the outcome, then you may take the OO approach, i.e. create a data structure as simple as a hash or more preferably a Perl class to represent an Interval (a Feature?) of the form `Chr1 42000 44000 A4`. It will keep your code tidier and more compartmental, little boxes so to speak. Unfortunately if you have trillions of those items, then you must consider very lean data structures.	[reply] [d/l] [select]