in reply to Merging intervals; Chaining intervals

Why not use a combination of bedtools and parsing using Perl?

Bedtools is an incredibly powerful tool for handling intervals. In my experience, there is almost always a way to accomplish complex interval arithmetic using bedtools.

Problem 2 ("chaining") first, because it is easiest using Bedtools:

Steps:

1. Combine the file 1 and file 2 and sort them as you already mentioned in your question. (Make sure to separate the columns by tabs.)

2. Run the following bedtools merge command on the combined-sorted file:

bedtools merge -d 7000 -c 4 -o distinct -i combinedFile

The output:

Chr1 1000 4000 A1 Chr1 12000 32000 A2,A3,B1,B2 Chr1 42000 44000 A4

I guess one way to exclude the cases where there are runs of As or runs of Bs is to write a Perl parser that makes sure that that is not the case!

Problem 1: this needs some Perl parsing following bedtools processing. But it's much easier than writing a Perl code from scratch.

Steps:

1. Keep File 1 and File 2 separate. (Make sure the columns are tab-separated, and the file is sorted by "Chr" column and first coordinate column.)

2. Run the following bedtools closest command on the input files.

# bedtools closest -a File1 -b File2 -d Chr1 1000 4000 A1 Chr1 12000 18500 B1 8001 Chr1 15000 22000 A2 Chr1 12000 18500 B1 0 Chr1 30000 32000 A3 Chr1 28000 29000 B2 1001 Chr1 42000 44000 A4 Chr1 28000 29000 B2 13001

The 'closest' command finds, for each feature in A, a closest feature in B. The option -d gives you distance between each pair of A-B intervals.

3. You can sort this output by the distance column (#9), which will put the closest intervals first. You can parse this output to first print out the closest interval, while keeping (in a hash) a track of which "interval ids" (column #4) are already printed. For all subsequent lines, you can look up in the hash if either of the interval ids (in a pair of A-B intervals) is already printed. If so, print the interval for the remaining interval id. Below I provide an example code that does exactly this.

While, as you will see below, my code does exactly what you want for the toy input files, you might need to edit my bedtools command or my example Perl code to accommodate the complexities that might be absent in your toy example inputs.

I wrote a Perl one-liner, so that I could pipe the output of bedtools into Perl. Alternatively, you can store the output of bedtools into a file and write a Perl parser in a Perl file that reads from the bedtools output file.

Code for sorting bedtools output by the distance column.
# bedtools closest -a File1 -b File2 -d | sort -k9,9n Chr1 15000 22000 A2 Chr1 12000 18500 B1 0 Chr1 30000 32000 A3 Chr1 28000 29000 B2 1001 Chr1 1000 4000 A1 Chr1 12000 18500 B1 8001 Chr1 42000 44000 A4 Chr1 28000 29000 B2 13001
Perl one-liner combined with bedtools command:
# bedtools closest -a ta -b tb -d | sort -k9,9n | perl -ne 'BEGIN{$lim +itedDistance=10000; use List::Util qw(min max); my %hash=();} @a=spli +t; $idA=$a[3]; $idB=$a[7]; $dis=$a[8]; if($hash{$idA} and !$hash{$idB +}){print "$a[4]\t$a[5]\t$a[6]\t$a[7]\n"; $hash{$idB}=1;} if($hash{$i +dB} and ! $hash{$idA}){print "$a[0]\t$a[1]\t$a[2]\t$a[3]\n"; $hash{$i +dA}=1; } if(!$hash{$idA} and ! $hash{$idB}){ if($dis > $limitedDistan +ce){ print "$a[0]\t$a[1]\t$a[2]\t$a[3]\n"; print "$a[4]\t$a[5]\t$a[6] +\t$a[7]\n";} else{ $start=min($a[1], $a[5]); $end=max($a[2],$a[6]); p +rint "$a[0]\t$start\t$end\t$a[3],$a[7]\n"; $hash{$idA}=1; $hash{$idB} +=1; } } ' | sort -k1,1 -k2,2n
Output:
Chr1 1000 4000 A1 Chr1 12000 22000 A2,B1 Chr1 28000 32000 A3,B2 Chr1 42000 44000 A4
Extended Perl code for better readability:
use warnings; use strict; use List::Util qw(min max); my %hash=(); my $limitedDistance = 10000; open(INP, "bedtoolsOutput"); while(<INP>) { my @a=split; my $idA=$a[3]; my $idB=$a[7]; my $dis=$a[8]; if($hash{$idA} and !$hash{$idB}){ print "$a[4]\t$a[5]\t$a[6]\t$a[7]\n"; $hash{$idB}=1; } if($hash{$idB} and ! $hash{$idA}){ print "$a[0]\t$a[1]\t$a[2]\t$a[3]\n"; $hash{$idA}=1; } if(!$hash{$idA} and ! $hash{$idB}){ if($dis > $limitedDistance) { print "$a[0]\t$a[1]\t$a[2]\t$a[3]\n"; print "$a[4]\t$a[5]\t$a[6]\t$a[7]\n"; } else { my $start=min($a[1], $a[5]); my $end=max($a[2],$a[6]); print "$a[0]\t$start\t$end\t$a[3],$a[7]\n"; $hash{$idA}=1; $hash{$idB}=1; } } }
EDIT: To give you an example of cases where my example code may not work optimally: in case where there are intervals in B that are not closest to any of the interval in A; such intervals of B won't be printed!

HTH!