Merging intervals; Chaining intervals

onlyIDleft has asked for the wisdom of the Perl Monks concerning the following question:

Hello Perl Monks!

I have two bioinformatics problems involving GFF3 and BED file formats, with no obvious off the shelf solution from BedTools or BedOps. The problems are:

Problem 1. Merge coordinates between two BED files, but not within them, at less than or equal to user-specified distance of separation

Problem 2. Chain coordinates between two BED files, at less than or equal to user-specified distance of separation

I have not written code yet, because I wonder if there are Perl or even BioPerl modules to process generic numerical intervals, or more specifically genomic intervals. Are there? I could not find any. I provide an example below for expected solution to problems 1 and 2. Thank you!

File 1 - Genomic or numerical coords of feature type A - A1, A2, ... An

File 2 - Genomic or numerical coords of feature type B - B1, B2, ...Bm

Let's say the concatenated Files 1 and 2, followed by sort looks like this:

Chr1 1000 4000 A1
Chr1 12000 18500 B1
Chr1 15000 22000 A2
Chr1 28000 29000 B2
Chr1 30000 32000 A3
Chr1 42000 44000 A4
[download]

Problem 1 - Report merged coordinates for pairs of AB separated by not more than 10KB

note that the merged intervals cannot be for consecutive features of ONLY As or Bs, it MUST be pairs of A+B, i.e. only one of each of A and B, OR they can be singletons of A or B that are UNpaired

Chr1 1000 4000 A1
Chr1 12000 22000 B1,A2
Chr1 28000 32000 B2,A3
Chr1 42000 44000 A4
[download]

# In the last line of the expected solution shown above, though separated by 10KB, which is user-specified limit, A4 cannot be paired with A3, since the pairs have to be strictly ABs

# A1 and B1 are not merged here because B1 and A2 overlap, and have no distance of separation between them, and therefore, the pairing of B1 with A2 is prioritized over pairing of B1 with A1

Problem 2 - Report coordinates for collapsed or chained intervals which contain As and Bs, component pairs of which are not separated by more than 7KB, in the example result below.

Here the "chained" coords may be singletons that cannot be chained due to distance, or when chained, can be runs of the same type, i.e., A1,A2 or B1,B2,B3

Chr1 1000 4000 A1
Chr1 12000 32000 B1,A2,B2,A3
Chr1 42000 44000 A4
[download]

# In the last line of expected solution shown above, separation is 10KB from A3, else it would've been a longer "chain" of B1,A2,B2,A3,A4

Comment on Merging intervals; Chaining intervals Select or Download Code

Replies are listed 'Best First'.
Re: Merging intervals; Chaining intervals by choroba (Cardinal) on Sep 10, 2018 at 11:37 UTC
I don't understand bioinformatics in such a detail (despite participating in Rosalind.info), so let me ask: in Problem 1, why are B2 and A3 merged, but not A1 and B1? B2 and A3 are separated by 1000, A1 and B1 are separated by 8000, which is still less than 10K. ($q=q:Sq=~/;[c](.)(.)/;chr(-\|\|-\|5+lengthSq)`"S\|oS2"`map{chr \|+ord }map{substrSq`S_+\|`\|}3E\|-\|`7**2-3:)=~y+S\|`+$1,++print+eval$q,q,a, [download]	[reply] [d/l]
Re^2: Merging intervals; Chaining intervals by onlyIDleft (Scribe) on Sep 10, 2018 at 13:54 UTC
Thank you for your insightful question. I added a comment line in my original post, also copy / pasted below. You understand bioinfromatics plenty :) Your thoughts about how to go about this parsing , after my update? Thanks! # A1 and B1 are not merged here because B1 and A2 overlap, and have no distance of separation between them, and therefore, the pairing of B1 with A2 is prioritized over pairing of B1 with A1	[reply]
Re^2: Merging intervals; Chaining intervals by Anonymous Monk on Sep 10, 2018 at 12:27 UTC
I don't understand bioinformatics in such a detail Neither do so called bioinformaticians	[reply]
Re: Merging intervals; Chaining intervals by bliako (Abbot) on Sep 10, 2018 at 15:23 UTC
Good luck () with IntervalTree. The idea is that you insert your items in a tree, their interval being the discriminating characteristic. Traversing the tree will be in the same order as if you lined up all those intervals (on a line). In traverse()'s order, calculate distance between an item's "before" and "after" neighbour and develop the logic to merge them or not. Here is a quick demo, notice the line-up of the items at the end: #!/usr/bin/perl # # author: Bliako # date: 10/09/2018 # for: https://perlmonks.org/?node_id=1222031 # use strict; use warnings; use IntervalTree; use Data::Dumper; # create the tree my $tree = IntervalTree->new(); while(<DATA>){ chomp; next if $_ =~ /^\s$/; my @items = split /\s+/; $tree->insert(@items[1,2,3]); } close(DATA); print_tree($tree); # print all the intervals sub print_tree { my $atree = shift; $atree->traverse( # specify a func to be run on every node as returned b +y traverse() sub { my $anode = $_[0]; print "traverse() : ".$anode->{interval}." => +".$anode->str()."\n"; } ); } __DATA__ Chr1 30000 32000 A3 Chr1 28000 29000 B2 Chr1 12000 18500 B1 Chr1 1000 4000 A1 Chr1 42000 44000 A4 Chr1 15000 22000 A2 [download] Result: `traverse() : (A1) => Node(1000, 4000) traverse() : (B1) => Node(12000, 18500) traverse() : (A2) => Node(15000, 22000) traverse() : (B2) => Node(28000, 29000) traverse() : (A3) => Node(30000, 32000) traverse() : (A4) => Node(42000, 44000)` [download] () : personally I found this module very frustrating to work with as it mixes together Node objects, Interval objects and Interval names. I had to modify the source of `_seek_left` and `_seek_right` to save `$self` rather than `$self->{interval}`. But for your modest need, the above will do nicely. There are even worst alternatives, search at cpan. If you are a Bioinformatician and have more serious bioinformatics needs I am available for hire - in the spirit of appeasing and keeping content the general anonymous public. If you are not, well I can always help for free when I can as I always do. bw, bliako Edit - Clarification*: my program lines up all your items in a line wrt their intervals (compare Result above and your sample output A1, B1, A2, B2, A3, A4.). The program will not do the merging as I was too lazy to read your requirements and translate to code. Now all you have to do is to check each item with its 2 or 4 neighbours (for when A neighbours A) following your rules. The important thing is that the "line-up" is done using the Interval Tree in O(n log n) time and will persist for classifying unknown intervals too.	[reply] [d/l] [select]
Re^2: Merging intervals; Chaining intervals by onlyIDleft (Scribe) on Sep 19, 2018 at 23:31 UTC
Thank you bliako, and apologies for my delayed response. Was forced to travel due to hurricane Florence! Do you have suggestions for how to parse the traverse output to a. pair start-stop intervals between types of features, not within a type, within user specified distance of one another b. chain intervals so that there must be be at least one each of both feature types, possibly more, also within user specified distance of one another	[reply]
Re^3: Merging intervals; Chaining intervals by bliako (Abbot) on Sep 20, 2018 at 12:22 UTC
When all the intervals are sorted on a line, i.e. as per traverse()'s output: A1, B1, A2, B2, A3, A4, I would create a sub which takes in 4 intervals as params: the current interval, its previous and its next and also a user-specified min-distance. For example: `sub merge { my ($pre,$cur,$nex, $mindist) = @_; # here write your logic for merging either cur+pre or cur+nex or no +merging at all given mindist my $decicion_to_merge = ... if( $decicion_to_merge == 1 ){ ... merge the intervals ... } }` [download] then make a sub to return all sorted Intervals as an array: `sub get_sorted_intervals_as_array { my $atree = shift; my @ret = (); $atree->traverse( # specify a func to be run on every node as returned b +y traverse() sub { my $anode = $_[0]; print "traverse() : ".$anode->{interval}." => +".$anode->str()."\n"; push(@ret, $anode); } ); return @ret }` [download] and here is how the merging can happen: `my @ints = get_sorted_intervals_as_array($tree); for($i=1;$i<scalar(@ints)-1;$i++){ $pre = $ints[$i-1]; $cur = $ints[$i]; $nex = $ints[$i+1]; merge($pre, $cur, $nex, 1234); # 1234 user-specified distance }` [download] Now the above is a very rough sketch and still requires you to encode your logic into the merge() sub. It also still needs you to decide what a 'merge' means: does it create a new interval, with new labels and discards the two parent intervals? For complex interval you can create a data structure to hold the data, maybe a class or a very simple hashtable. Here is a sketch as a hashtable, it's fairly easy to convert it to a Perl class once you agree on the fields: `my %Interval = ( 'chromosome' => 'Chr1', 'from' => 1000, 'to' => 4000, # to represent A1 (?): 'type' => 'A', 'id' => 1 );` [download] Then you can start making subs to operate on these intervals, e.g. : `sub can_merge { my ($i1, $i2) = @_; return $i1->{'type'} ne $2->{'type'} } sub merge { my ($i1, $i2, $distance) = @_; my $newi = {}; if( $distance ... ){ return undef } # no merge happened because dist +ance etc. if( ! can_merge($i1, $i2) ){ return undef } # no merge, types incomp +atible. $newi->{from} = List::Util::min($i1->{from}, $i2->{from}); $newi->{to} = List::Util::max($i1->{to}, $i2->{to}); ... return $newi # return the new interval representing the merge }` [download] So, the bottom line: if you have complex merging rules or may be your rules become complex in the future or you may want to experiment on different rules to see the outcome, then you may take the OO approach, i.e. create a data structure as simple as a hash or more preferably a Perl class to represent an Interval (a Feature?) of the form `Chr1 42000 44000 A4`. It will keep your code tidier and more compartmental, little boxes so to speak. Unfortunately if you have trillions of those items, then you must consider very lean data structures.	[reply] [d/l] [select]
Re: Merging intervals; Chaining intervals by ravipatel4 (Novice) on Sep 22, 2018 at 17:20 UTC
Why not use a combination of bedtools and parsing using Perl? Bedtools is an incredibly powerful tool for handling intervals. In my experience, there is almost always a way to accomplish complex interval arithmetic using bedtools. Problem 2 ("chaining") first, because it is easiest using Bedtools: Steps: 1. Combine the file 1 and file 2 and sort them as you already mentioned in your question. (Make sure to separate the columns by tabs.) 2. Run the following `bedtools merge` command on the combined-sorted file: `bedtools merge -d 7000 -c 4 -o distinct -i combinedFile` The output: `Chr1 1000 4000 A1 Chr1 12000 32000 A2,A3,B1,B2 Chr1 42000 44000 A4` [download] I guess one way to exclude the cases where there are runs of As or runs of Bs is to write a Perl parser that makes sure that that is not the case! Problem 1: this needs some Perl parsing following bedtools processing. But it's much easier than writing a Perl code from scratch. Steps: 1. Keep File 1 and File 2 separate. (Make sure the columns are tab-separated, and the file is sorted by "Chr" column and first coordinate column.) 2. Run the following `bedtools closest` command on the input files. `# bedtools closest -a File1 -b File2 -d Chr1 1000 4000 A1 Chr1 12000 18500 B1 8001 Chr1 15000 22000 A2 Chr1 12000 18500 B1 0 Chr1 30000 32000 A3 Chr1 28000 29000 B2 1001 Chr1 42000 44000 A4 Chr1 28000 29000 B2 13001` [download] The 'closest' command finds, for each feature in A, a closest feature in B. The option -d gives you distance between each pair of A-B intervals. 3. You can sort this output by the distance column (#9), which will put the closest intervals first. You can parse this output to first print out the closest interval, while keeping (in a hash) a track of which "interval ids" (column #4) are already printed. For all subsequent lines, you can look up in the hash if either of the interval ids (in a pair of A-B intervals) is already printed. If so, print the interval for the remaining interval id. Below I provide an example code that does exactly this. While, as you will see below, my code does exactly what you want for the toy input files, you might need to edit my bedtools command or my example Perl code to accommodate the complexities that might be absent in your toy example inputs. I wrote a Perl one-liner, so that I could pipe the output of bedtools into Perl. Alternatively, you can store the output of bedtools into a file and write a Perl parser in a Perl file that reads from the bedtools output file. Code for sorting bedtools output by the distance column. `# bedtools closest -a File1 -b File2 -d \| sort -k9,9n Chr1 15000 22000 A2 Chr1 12000 18500 B1 0 Chr1 30000 32000 A3 Chr1 28000 29000 B2 1001 Chr1 1000 4000 A1 Chr1 12000 18500 B1 8001 Chr1 42000 44000 A4 Chr1 28000 29000 B2 13001` [download] Perl one-liner combined with bedtools command: # bedtools closest -a ta -b tb -d \| sort -k9,9n \| perl -ne 'BEGIN{$lim +itedDistance=10000; use List::Util qw(min max); my %hash=();} @a=spli +t; $idA=$a[3]; $idB=$a[7]; $dis=$a[8]; if($hash{$idA} and !$hash{$idB +}){print "$a[4]\t$a[5]\t$a[6]\t$a[7]\n"; $hash{$idB}=1;} if($hash{$i +dB} and ! $hash{$idA}){print "$a[0]\t$a[1]\t$a[2]\t$a[3]\n"; $hash{$i +dA}=1; } if(!$hash{$idA} and ! $hash{$idB}){ if($dis > $limitedDistan +ce){ print "$a[0]\t$a[1]\t$a[2]\t$a[3]\n"; print "$a[4]\t$a[5]\t$a[6] +\t$a[7]\n";} else{ $start=min($a[1], $a[5]); $end=max($a[2],$a[6]); p +rint "$a[0]\t$start\t$end\t$a[3],$a[7]\n"; $hash{$idA}=1; $hash{$idB} +=1; } } ' \| sort -k1,1 -k2,2n [download] Output: `Chr1 1000 4000 A1 Chr1 12000 22000 A2,B1 Chr1 28000 32000 A3,B2 Chr1 42000 44000 A4` [download] Extended Perl code for better readability: use warnings; use strict; use List::Util qw(min max); my %hash=(); my $limitedDistance = 10000; open(INP, "bedtoolsOutput"); while(<INP>) { my @a=split; my $idA=$a[3]; my $idB=$a[7]; my $dis=$a[8]; if($hash{$idA} and !$hash{$idB}){ print "$a[4]\t$a[5]\t$a[6]\t$a[7]\n"; $hash{$idB}=1; } if($hash{$idB} and ! $hash{$idA}){ print "$a[0]\t$a[1]\t$a[2]\t$a[3]\n"; $hash{$idA}=1; } if(!$hash{$idA} and ! $hash{$idB}){ if($dis > $limitedDistance) { print "$a[0]\t$a[1]\t$a[2]\t$a[3]\n"; print "$a[4]\t$a[5]\t$a[6]\t$a[7]\n"; } else { my $start=min($a[1], $a[5]); my $end=max($a[2],$a[6]); print "$a[0]\t$start\t$end\t$a[3],$a[7]\n"; $hash{$idA}=1; $hash{$idB}=1; } } } [download] EDIT: To give you an example of cases where my example code may not work optimally: in case where there are intervals in B that are not closest to any of the interval in A; such intervals of B won't be printed! HTH!	[reply] [d/l] [select]