Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks i have 2 files
file1 contig1 10037203 10038203 blah contig1 10037203 10038203 blah contig1 10037203 10038203 blah
file2 contig1 997329 938329 blab11 contig1 10037329 10038329 blah11 contig1 10037329 10038329 blah11
i want to get my output as below
contig1 10037203 10038203 blah contig1 10037203 10038203 blah contig1 10037203 10038203 blah contig1 10037329 10038329 blah11 contig1 10037329 10038329 blah11
i,e., if the overlap exists first output all the overlapping ones of file1 and then of file2 i was able to read each line by line and then compare but that is not what i want. here is what i was doing
open FILE1,"file1" or die "can't open file 1"; open FILE2,"file2" or die "can't open file 2"; open(w1,">output"); while ((my $line1 = <FILE1>) && (my $line2 = <FILE2>)) { chomp $line1; chomp $line2; my @cond1 = split("\t" , $line1); my @cond2 = split("\t" , $line2); if(($cond1[1] >= $cond2[1]) && ($cond1[2] <= $cond2[2])) { print w1 "$line1\t$line2\n"; } } close FILE1; close FILE2; close w1;

Replies are listed 'Best First'.
Re: output the overlapping regions to new file
by starX (Chaplain) on Jan 22, 2014 at 17:13 UTC
    I don't know how big your files are, so I don't know how practical of a solution this will be, but you might try saving each sequence from each file as the key of a hash (set the value to 1), and then checking to see if the keys of one hash are present in the other, and if so, save those as the keys of a third hash, which then contains the sequences that are duplicated between the two files.
Re: output the overlapping regions to new file
by Laurent_R (Canon) on Jan 22, 2014 at 22:59 UTC
    Sorry, I don't completely understand what you want to do and what your data really looks like. Looking at file 1:
    contig1 10037203 10038203 blah contig1 10037203 10038203 blah contig1 10037203 10038203 blah
    When you have a range starting with 10037203, in file 1, is it the case that all lines of file 1 where the range starts with 10037203 will also complete with the same and or range value (10038203 in the example)? If such is the case, you start by reading file 1 and store in a hash the start value (as the key) and the end value (as the value). At the same time, also store all the lines with the same range start value (in another hash, for example, or in the same hash if you know how to use a hash of hashes).

    Then, you read file 2, for each line, get the current start and end values and look for all keys of the first hash created above:

    - which are between the start and end values of the current line (cases 3 and 4 in the scheme below);

    - where the start value (key) is below the current start and the end (value) above the current start value (cases 1 and 2 below);

    - etc.

    It is probably easier to get it right with a graphic representation. You basically have four possible cases of overlap:

    F1: |-----| F2: |-----| F1: |-----| F2: |--| F1: |-----| F2: |-----| F1: |-----| F2: |---------|
    If you find an overlap, print the stored lines from file 1 and then all lines of file 2 which have the same start point.

    If your files are large, this might become time consuming, but it might be optimized greatly if you know something more about the maximum possible ranges.

    Well, these are quick guidelines on a possible solution, I haven't worked out all the details, but I think it should more or less work.

    Another possible and completely different approach would be to sort both files in accordance to their start values and then read them in parallel. But I should warn you that this is not exactly as easy as it might look at first glance, you'll find a number of special cases to take into account.

Re: output the overlapping regions to new file
by bioinformatics (Friar) on Jan 24, 2014 at 00:21 UTC

    There are plenty of tools to do this already, such as bed tools; you can use pybedtools (python wrapper for bedtools) and perform the comparisons in ipython. While not using Perl, pybedtools will allow you to compare multiple BED files at the same time (i.e., more than 2), and is both fast and simple to use. While what you are trying to do isn't bad, there are better tools for the job.

    Bioinformatics