Re: Merging intervals; Chaining intervals

Why not use a combination of bedtools and parsing using Perl?

Bedtools is an incredibly powerful tool for handling intervals. In my experience, there is almost always a way to accomplish complex interval arithmetic using bedtools.

Problem 2 ("chaining") first, because it is easiest using Bedtools:

Steps:

1. Combine the file 1 and file 2 and sort them as you already mentioned in your question. (Make sure to separate the columns by tabs.)

2. Run the following bedtools merge command on the combined-sorted file:

bedtools merge -d 7000 -c 4 -o distinct -i combinedFile

The output:

Chr1    1000    4000    A1
Chr1    12000   32000   A2,A3,B1,B2
Chr1    42000   44000   A4
[download]

I guess one way to exclude the cases where there are runs of As or runs of Bs is to write a Perl parser that makes sure that that is not the case!

Problem 1: this needs some Perl parsing following bedtools processing. But it's much easier than writing a Perl code from scratch.

Steps:

1. Keep File 1 and File 2 separate. (Make sure the columns are tab-separated, and the file is sorted by "Chr" column and first coordinate column.)

2. Run the following bedtools closest command on the input files.

# bedtools closest -a File1 -b File2 -d
Chr1    1000    4000    A1      Chr1    12000   18500   B1      8001
Chr1    15000   22000   A2      Chr1    12000   18500   B1      0
Chr1    30000   32000   A3      Chr1    28000   29000   B2      1001
Chr1    42000   44000   A4      Chr1    28000   29000   B2      13001
[download]

The 'closest' command finds, for each feature in A, a closest feature in B. The option -d gives you distance between each pair of A-B intervals.

3. You can sort this output by the distance column (#9), which will put the closest intervals first. You can parse this output to first print out the closest interval, while keeping (in a hash) a track of which "interval ids" (column #4) are already printed. For all subsequent lines, you can look up in the hash if either of the interval ids (in a pair of A-B intervals) is already printed. If so, print the interval for the remaining interval id. Below I provide an example code that does exactly this.

While, as you will see below, my code does exactly what you want for the toy input files, you might need to edit my bedtools command or my example Perl code to accommodate the complexities that might be absent in your toy example inputs.

I wrote a Perl one-liner, so that I could pipe the output of bedtools into Perl. Alternatively, you can store the output of bedtools into a file and write a Perl parser in a Perl file that reads from the bedtools output file.

Code for sorting bedtools output by the distance column.

# bedtools closest -a File1 -b File2 -d | sort -k9,9n
Chr1    15000   22000   A2      Chr1    12000   18500   B1      0
Chr1    30000   32000   A3      Chr1    28000   29000   B2      1001
Chr1    1000    4000    A1      Chr1    12000   18500   B1      8001
Chr1    42000   44000   A4      Chr1    28000   29000   B2      13001
[download]

Perl one-liner combined with bedtools command:

# bedtools closest -a ta -b tb -d | sort -k9,9n | perl -ne 'BEGIN{$lim
+itedDistance=10000; use List::Util qw(min max); my %hash=();} @a=spli
+t; $idA=$a[3]; $idB=$a[7]; $dis=$a[8]; if($hash{$idA} and !$hash{$idB
+}){print "$a[4]\t$a[5]\t$a[6]\t$a[7]\n"; $hash{$idB}=1;}  if($hash{$i
+dB} and ! $hash{$idA}){print "$a[0]\t$a[1]\t$a[2]\t$a[3]\n"; $hash{$i
+dA}=1; } if(!$hash{$idA} and ! $hash{$idB}){ if($dis > $limitedDistan
+ce){ print "$a[0]\t$a[1]\t$a[2]\t$a[3]\n"; print "$a[4]\t$a[5]\t$a[6]
+\t$a[7]\n";} else{ $start=min($a[1], $a[5]); $end=max($a[2],$a[6]); p
+rint "$a[0]\t$start\t$end\t$a[3],$a[7]\n"; $hash{$idA}=1; $hash{$idB}
+=1; } } ' | sort -k1,1 -k2,2n
[download]

Output:

Chr1    1000    4000    A1
Chr1    12000   22000   A2,B1
Chr1    28000   32000   A3,B2
Chr1    42000   44000   A4
[download]

Extended Perl code for better readability:

use warnings;
use strict;
use List::Util qw(min max);
my %hash=();
my $limitedDistance = 10000;
open(INP, "bedtoolsOutput");
while(<INP>) {
        my @a=split;
        my $idA=$a[3];
        my $idB=$a[7];
        my $dis=$a[8];
        if($hash{$idA} and !$hash{$idB}){
                print "$a[4]\t$a[5]\t$a[6]\t$a[7]\n";
                $hash{$idB}=1;
        }
        if($hash{$idB} and ! $hash{$idA}){
                print "$a[0]\t$a[1]\t$a[2]\t$a[3]\n";
                $hash{$idA}=1;
        }
        if(!$hash{$idA} and ! $hash{$idB}){
                if($dis > $limitedDistance) {
                        print "$a[0]\t$a[1]\t$a[2]\t$a[3]\n";
                        print "$a[4]\t$a[5]\t$a[6]\t$a[7]\n";
                }
                else {
                        my $start=min($a[1], $a[5]);
                        my $end=max($a[2],$a[6]);
                        print "$a[0]\t$start\t$end\t$a[3],$a[7]\n";
                        $hash{$idA}=1;
                        $hash{$idB}=1;
                }
        }
}
[download]

EDIT: To give you an example of cases where my example code may not work optimally: in case where there are intervals in B that are not closest to any of the interval in A; such intervals of B won't be printed!

HTH!

Comment on Re: Merging intervals; Chaining intervals Select or Download Code