in reply to A more efficient sort or heap algorithm...

If I'm right in thinking that your are only using a hand-coded bubblesort so that you can sort two parallel arrays together, then you are right in thinking that there is a better way. Actually several.

  1. One way would be to avoid using parallel arrays.

    Instead of:

    push @sort_array, $trial_1; push @pointer_array, $key_r;

    You build an Array of Arrays:

    push @sort_array, [ $trial_1, $key_r ];

    You can then sort both datasets using the (much faster) built-in sort routine like so:

    my @sorted = sort{ $a->[ 0 ] <=> $b->[ 0 ] } @sort_array;

    And get your $closest and $pointer values like this:

    my( $closest, $pointer ) = @{ $sorted[ 0 ] };
  2. The second way is to sort the indices by the values in the first array. These will then be applicable to the second array also. See the code below:

But you don't need to sort at all!

As you are only after the key associated with the closest pairing, you don't need to build any arrays, nor sort. You can just keep track of the lowest value (and associated key) as you traverse the datasets.

Also, as you are comparing all the keys from both datasets with each other, there is no need to sort the keys either. The following should run substantially more quickly:

#!/usr/bin/perl -slw use strict; use integer; use vars qw( %tf1 %tf2 ); ## Establish an array of all chromosome files. my @files = ("chr1", "chr2", "chr3", "chr4", "chr5", "chr6", "chr7", " +chr8", "chr9", "chr10", "chr11", "chr12", "chr13", "chr14", "chr15", "ch +r16", "chr17", "chr18", "chr19", "chr20", "chr21", "chr22", "chrX", "chr +Y"); ## Loop through each file, calculate the overlaps, and output to file. foreach (@files) { print STDOUT scalar localtime; open TF1, "<", "TF1.$_.gff" or die "Cannot open the TF1 hit file!" +; ## tf1 hit file open TF2, "<", "TF2.$_.gff" or die "Cannot open the TF2 hit file!" +; ## tf2 hit file open OUT, ">", "both.$_.gff" or die "Cannot open the output file!" +; ## output file while (<TF1>) { chomp; my ($location, undef, $type, $start, $end, $score, $strand, un +def) = split "\t"; ## Parse my ($chr, $location_numbers) = split "_", $location; + ## Get the chromosome my ($target_start, $target_end) = split "-", $location_numbers +; ## Get the start site for the 10kb sequence "blasted" against my $new_start = $target_start + $start; ## Use this for true +genomic location my $new_end = $target_start + $end; ## Use thie for true +genomic location ## Build a hash and put the data inside; $tf1{ $new_start }[0] = $chr; $tf1{ $new_start }[1] = $type; ## Transcription factor $tf1{ $new_start }[2] = $new_end; $tf1{ $new_start }[3] = $score; ## Alignment score from PWM bl +ast $tf1{ $new_start }[4] = $strand; ## Strand where the match is +found } while (<TF2>) { chomp; my ($location, undef, $type, $start, $end, $score, $strand, un +def) = split "\t"; ## Parse my ($chr, $location_numbers) = split "_", $location; + ## Get the chromosome my ($target_start, $target_end) = split "-", $location_numbers +; ## Get the start site for the 10kb sequence "blasted" against my $new_start = $target_start + $start; ## Use this for true +genomic location my $new_end = $target_start + $end; ## Use thie for true +genomic location ## Build a hash and put the data inside; $tf2{ $new_start }[0] = $chr; $tf2{ $new_start }[1] = $type; ## Transcription factor $tf2{ $new_start }[2] = $new_end; $tf2{ $new_start }[3] = $score; ## Alignment score from PWM bl +ast $tf2{ $new_start }[4] = $strand; ## Strand where the match is +found } ## Close the unused files close TF1; close TF2; foreach my $key_c ( keys %tf2 ) { my( $best, $best_key ) = ( 1e99, '' ); my $start_c = $key_c; my $end_c = $tf2{ $key_c }[2]; foreach my $key_r ( keys %tf1) { my $start_r = $key_r; my $end_r = $tf1{ $key_r }[2]; my $trial_1 = $start_r - $end_c; my $trial_2 = $start_c - $end_r; if ($trial_1 >= 0) { if( $trial_1 < $best ) { $best = $trial_1; $best_key = $key_r; } } elsif ($trial_2 >=0) { if( $trial_2 < $best ) { $best = $trial_2; $best_key = $key_r; } } } if( $best < 50 ) { print OUT join "\t", $tf2{ $key_c }[0], $tf2{ $key_c }[1], 'tf2', $start_c, + $end_c, $tf2{ $key_c }[3], $tf2{ $key_c }[4], 'tf1', $best_key +, $tf1{ $best_key}[2], $tf1{ $best_key}[3], $tf1{ $best_ +key}[4]; } } ## Close the output file close OUT; print STDOUT scalar localtime; } ## cycle through again til done with all files

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."

Replies are listed 'Best First'.
Re^2: A more efficient sort or heap algorithm...
by bioinformatics (Friar) on Apr 28, 2009 at 22:55 UTC
    Your advice is fantastic as per usual. The array of arrays does make a lot more sense, suggestion #3 makes even more than the current stuff. I'll try both to get a better feel how each performs, and try and be a little more clever next time. Thanks :).
    Bioinformatics
      You could also take a look at how GFF files are stored in Bio::DB::GFF or Bio::DB::SeqFeature::Store which use RDMS to handle these types of range queries (mysql,postgres, sqlite) and an in-memory berkeleydb implementation. The get_features_by_location is basically what you'd want to use for your problem.