Re^2: Merging intervals; Chaining intervals

Replies are listed 'Best First'.
Re^3: Merging intervals; Chaining intervals by bliako (Abbot) on Sep 20, 2018 at 12:22 UTC
When all the intervals are sorted on a line, i.e. as per traverse()'s output: A1, B1, A2, B2, A3, A4, I would create a sub which takes in 4 intervals as params: the current interval, its previous and its next and also a user-specified min-distance. For example: `sub merge { my ($pre,$cur,$nex, $mindist) = @_; # here write your logic for merging either cur+pre or cur+nex or no +merging at all given mindist my $decicion_to_merge = ... if( $decicion_to_merge == 1 ){ ... merge the intervals ... } }` [download] then make a sub to return all sorted Intervals as an array: `sub get_sorted_intervals_as_array { my $atree = shift; my @ret = (); $atree->traverse( # specify a func to be run on every node as returned b +y traverse() sub { my $anode = $_[0]; print "traverse() : ".$anode->{interval}." => +".$anode->str()."\n"; push(@ret, $anode); } ); return @ret }` [download] and here is how the merging can happen: `my @ints = get_sorted_intervals_as_array($tree); for($i=1;$i<scalar(@ints)-1;$i++){ $pre = $ints[$i-1]; $cur = $ints[$i]; $nex = $ints[$i+1]; merge($pre, $cur, $nex, 1234); # 1234 user-specified distance }` [download] Now the above is a very rough sketch and still requires you to encode your logic into the merge() sub. It also still needs you to decide what a 'merge' means: does it create a new interval, with new labels and discards the two parent intervals? For complex interval you can create a data structure to hold the data, maybe a class or a very simple hashtable. Here is a sketch as a hashtable, it's fairly easy to convert it to a Perl class once you agree on the fields: `my %Interval = ( 'chromosome' => 'Chr1', 'from' => 1000, 'to' => 4000, # to represent A1 (?): 'type' => 'A', 'id' => 1 );` [download] Then you can start making subs to operate on these intervals, e.g. : `sub can_merge { my ($i1, $i2) = @_; return $i1->{'type'} ne $2->{'type'} } sub merge { my ($i1, $i2, $distance) = @_; my $newi = {}; if( $distance ... ){ return undef } # no merge happened because dist +ance etc. if( ! can_merge($i1, $i2) ){ return undef } # no merge, types incomp +atible. $newi->{from} = List::Util::min($i1->{from}, $i2->{from}); $newi->{to} = List::Util::max($i1->{to}, $i2->{to}); ... return $newi # return the new interval representing the merge }` [download] So, the bottom line: if you have complex merging rules or may be your rules become complex in the future or you may want to experiment on different rules to see the outcome, then you may take the OO approach, i.e. create a data structure as simple as a hash or more preferably a Perl class to represent an Interval (a Feature?) of the form `Chr1 42000 44000 A4`. It will keep your code tidier and more compartmental, little boxes so to speak. Unfortunately if you have trillions of those items, then you must consider very lean data structures.	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re^3: Merging intervals; Chaining intervals
by bliako (Abbot) on Sep 20, 2018 at 12:22 UTC

When all the intervals are sorted on a line, i.e. as per traverse()'s output: A1, B1, A2, B2, A3, A4, I would create a sub which takes in 4 intervals as params: the current interval, its previous and its next and also a user-specified min-distance. For example:

sub merge {
  my ($pre,$cur,$nex, $mindist) = @_;
  # here write your logic for merging either cur+pre or cur+nex or no 
+merging at all given mindist
  my $decicion_to_merge = ...
  if( $decicion_to_merge == 1 ){ ... merge the intervals ... }
}
[download]

then make a sub to return all sorted Intervals as an array:

sub     get_sorted_intervals_as_array {
        my $atree = shift;
        my @ret = ();
        $atree->traverse(
                # specify a func to be run on every node as returned b
+y traverse()
                sub {
                        my $anode = $_[0];
                        print "traverse() : ".$anode->{interval}." => 
+".$anode->str()."\n";
                        push(@ret, $anode);
                }
        );
        return @ret
}
[download]

and here is how the merging can happen:

my @ints = get_sorted_intervals_as_array($tree);
for($i=1;$i<scalar(@ints)-1;$i++){
   $pre = $ints[$i-1];
   $cur = $ints[$i];
   $nex = $ints[$i+1];
   merge($pre, $cur, $nex, 1234); # 1234 user-specified distance
}
[download]

Now the above is a very rough sketch and still requires you to encode your logic into the merge() sub. It also still needs you to decide what a 'merge' means: does it create a new interval, with new labels and discards the two parent intervals? For complex interval you can create a data structure to hold the data, maybe a class or a very simple hashtable.

Here is a sketch as a hashtable, it's fairly easy to convert it to a Perl class once you agree on the fields:

my %Interval  = (
   'chromosome' => 'Chr1',
   'from' => 1000,
   'to' => 4000,
    # to represent A1 (?):
   'type' => 'A',
   'id' => 1
);
[download]

Then you can start making subs to operate on these intervals, e.g. :

sub can_merge {
  my ($i1, $i2) = @_;
  return $i1->{'type'} ne $2->{'type'}
}
sub merge {
  my ($i1, $i2, $distance) = @_;
  my $newi = {};
  if( $distance ... ){ return undef } # no merge happened because dist
+ance etc.
  if( ! can_merge($i1, $i2) ){ return undef } # no merge, types incomp
+atible.
  $newi->{from} = List::Util::min($i1->{from}, $i2->{from});
  $newi->{to} = List::Util::max($i1->{to}, $i2->{to});
  ...
  return $newi # return the new interval representing the merge
}
[download]

So, the bottom line: if you have complex merging rules or may be your rules become complex in the future or you may want to experiment on different rules to see the outcome, then you may take the OO approach, i.e. create a data structure as simple as a hash or more preferably a Perl class to represent an Interval (a Feature?) of the form Chr1 42000 44000 A4. It will keep your code tidier and more compartmental, little boxes so to speak. Unfortunately if you have trillions of those items, then you must consider very lean data structures.

[reply]
[d/l]
[select]