comment on

Hello all,
I've written a bit of code to try and locate possible matches between two large lists of values. Specifically, these values are genomic locations/regions, and I'm looking for all that are less than 50 (base-pairs) away. With more than 100,000 values per list, this can be a bit time consuming in CPU cycles, and I'm experimenting to find a better method (read faster) to do this, as I have many such files to make this comparison for. My current code is below, and I welcome your thoughts on optimization, as I might be better off implementing a heap rather than a sort function (which takes a really long time, although a different sort function might be the answer as well). Thanks :).

#!/usr/bin/perl -slw
use strict;
use integer;
use vars qw( %tf1 %tf2 );

## Establish an array of all chromosome files.
my @files = ("chr1", "chr2", "chr3", "chr4", "chr5", "chr6", "chr7", "
+chr8", "chr9",
             "chr10", "chr11", "chr12", "chr13", "chr14", "chr15", "ch
+r16", "chr17",
             "chr18", "chr19", "chr20", "chr21", "chr22", "chrX", "chr
+Y");

## Loop through each file, calculate the overlaps, and output to file.
foreach (@files) {
    print STDOUT scalar localtime;
    open TF1, "<", "TF1.$_.gff" or die "Cannot open the TF1 hit file!"
+;  ## tf1 hit file
    open TF2, "<", "TF2.$_.gff" or die "Cannot open the TF2 hit file!"
+; ## tf2 hit file
    open OUT, ">", "both.$_.gff" or die "Cannot open the output file!"
+;    ## output file
    
    while (<TF1>) {
        chomp;
        my ($location, undef, $type, $start, $end, $score, $strand, un
+def) = split "\t"; ## Parse
        my ($chr, $location_numbers) = split "_", $location;          
+                          ## Get the chromosome
        my ($target_start, $target_end) = split "-", $location_numbers
+; ## Get the start site for the 10kb sequence "blasted" against
        my $new_start = $target_start + $start;  ## Use this for true 
+genomic location
        my $new_end = $target_start + $end;      ## Use thie for true 
+genomic location
        
        ## Build a hash and put the data inside;
        $tf1{ $new_start }[0] = $chr;
        $tf1{ $new_start }[1] = $type; ## Transcription factor
        $tf1{ $new_start }[2] = $new_end;
        $tf1{ $new_start }[3] = $score; ## Alignment score from PWM bl
+ast
        $tf1{ $new_start }[4] = $strand; ## Strand where the match is 
+found
    }
    
    while (<TF2>) {
        chomp;
        my ($location, undef, $type, $start, $end, $score, $strand, un
+def) = split "\t"; ## Parse
        my ($chr, $location_numbers) = split "_", $location;          
+                          ## Get the chromosome
        my ($target_start, $target_end) = split "-", $location_numbers
+; ## Get the start site for the 10kb sequence "blasted" against
        my $new_start = $target_start + $start;  ## Use this for true 
+genomic location
        my $new_end = $target_start + $end;      ## Use thie for true 
+genomic location
        
        ## Build a hash and put the data inside;
        $tf2{ $new_start }[0] = $chr;
        $tf2{ $new_start }[1] = $type; ## Transcription factor
        $tf2{ $new_start }[2] = $new_end;
        $tf2{ $new_start }[3] = $score; ## Alignment score from PWM bl
+ast
        $tf2{ $new_start }[4] = $strand; ## Strand where the match is 
+found
    }
    
    ## Close the unused files
    close TF1;
    close TF2;
    
    OUTER:foreach my $key_c (sort keys %tf2) {
        my @sort_array = ();
    my @pointer_array = ();
        my $start_c = $key_c;
        my $end_c = $tf2{ $key_c }[2];
        INNER:foreach my $key_r (sort keys %tf1) {
            my $start_r = $key_r;
            my $end_r = $tf1{ $key_r }[2];
            my $trial_1 = $start_r - $end_c;
            my $trial_2 = $start_c - $end_r;
            if ($trial_1 >= 0) {
                push @sort_array, $trial_1;
        push @pointer_array, $key_r;
                next INNER;
            }
            elsif ($trial_2 >=0) {
                push @sort_array, $trial_2;
        push @pointer_array, $key_r;
                next INNER;
            }
            else { next INNER; }
        }
        my $array = \@sort_array; ## create array ref
    my $array_which = \@pointer_array; ## know which tf1 binding site
        my @closest = bubblesmart($array, $array_which); ## pipe to su
+b and get the closest value
        
    my $closest = shift @closest;
    my $pointer = shift @closest;

        if ($closest < 50) {
            print OUT "$tf2{ $key_c }[0]\t$tf2{ $key_c }[1]\ttf2\t$sta
+rt_c\t$end_c\t$tf2{ $key_c }[3]\t$tf2{ $key_c }[4]",
        "tf1\t$pointer\t$tf1{ $pointer }[2]\t$tf1{ $pointer }[3]\t$tf1
+{ $pointer }[4]\t";
        }
        else { next OUTER; }
    }
    ## Close the output file
    close OUT;
    print STDOUT scalar localtime;
} ## cycle through again til done with all files

## Code taken from Mastering Algorithms with Perl, with a small modifi
+cation
sub bubblesmart {
    my $array=shift;
    my $array_which = shift;
    my $start = 0; ## The start index of the bubbling scan.
    my $ncomp = 0; ## The number of comparisons.
    my $nswap = 0; ## The number of swaps.
    
    my $i = $#$array;
    
    while (1) {
        my $new_start;   ## The new start index of the bubbling scan.
        my $new_end = 0; ## The new end index of the bubbling scan.
        
        for (my $j = $start || 1; $j <= $i; $j++) {
            $ncomp++;
            if ($array->[ $j -1 ] gt $array->[ $j ]) {
                @$array[ $j, $j -1 ] = @$array[ $j - 1, $j ];
        @$array_which[ $j, $j -1 ] = @$array_which[ $j - 1, $j ];
                $nswap++;
                $new_end = $j - 1;
                $new_start = $j - 1 unless defined $new_start;
            
            }
        }
        last unless defined $new_start; ## No swaps: we're done.
        $i = $new_end;
        $start = $new_start;
    }
    my $return_value = pop @$array;
    my $return_point = pop @$array_which;
    #print STDOUT "bubblesmart: ", scalar @$array, " elements, $ncomp 
+comparisons, $nswap swaps\n";
    return $return_value, $return_point;
}
[download]

Bioinformatics

In reply to A more efficient sort or heap algorithm... by bioinformatics

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.