bioinformatics has asked for the wisdom of the Perl Monks concerning the following question:

Hello all,
I've written a bit of code to try and locate possible matches between two large lists of values. Specifically, these values are genomic locations/regions, and I'm looking for all that are less than 50 (base-pairs) away. With more than 100,000 values per list, this can be a bit time consuming in CPU cycles, and I'm experimenting to find a better method (read faster) to do this, as I have many such files to make this comparison for. My current code is below, and I welcome your thoughts on optimization, as I might be better off implementing a heap rather than a sort function (which takes a really long time, although a different sort function might be the answer as well). Thanks :).
#!/usr/bin/perl -slw use strict; use integer; use vars qw( %tf1 %tf2 ); ## Establish an array of all chromosome files. my @files = ("chr1", "chr2", "chr3", "chr4", "chr5", "chr6", "chr7", " +chr8", "chr9", "chr10", "chr11", "chr12", "chr13", "chr14", "chr15", "ch +r16", "chr17", "chr18", "chr19", "chr20", "chr21", "chr22", "chrX", "chr +Y"); ## Loop through each file, calculate the overlaps, and output to file. foreach (@files) { print STDOUT scalar localtime; open TF1, "<", "TF1.$_.gff" or die "Cannot open the TF1 hit file!" +; ## tf1 hit file open TF2, "<", "TF2.$_.gff" or die "Cannot open the TF2 hit file!" +; ## tf2 hit file open OUT, ">", "both.$_.gff" or die "Cannot open the output file!" +; ## output file while (<TF1>) { chomp; my ($location, undef, $type, $start, $end, $score, $strand, un +def) = split "\t"; ## Parse my ($chr, $location_numbers) = split "_", $location; + ## Get the chromosome my ($target_start, $target_end) = split "-", $location_numbers +; ## Get the start site for the 10kb sequence "blasted" against my $new_start = $target_start + $start; ## Use this for true +genomic location my $new_end = $target_start + $end; ## Use thie for true +genomic location ## Build a hash and put the data inside; $tf1{ $new_start }[0] = $chr; $tf1{ $new_start }[1] = $type; ## Transcription factor $tf1{ $new_start }[2] = $new_end; $tf1{ $new_start }[3] = $score; ## Alignment score from PWM bl +ast $tf1{ $new_start }[4] = $strand; ## Strand where the match is +found } while (<TF2>) { chomp; my ($location, undef, $type, $start, $end, $score, $strand, un +def) = split "\t"; ## Parse my ($chr, $location_numbers) = split "_", $location; + ## Get the chromosome my ($target_start, $target_end) = split "-", $location_numbers +; ## Get the start site for the 10kb sequence "blasted" against my $new_start = $target_start + $start; ## Use this for true +genomic location my $new_end = $target_start + $end; ## Use thie for true +genomic location ## Build a hash and put the data inside; $tf2{ $new_start }[0] = $chr; $tf2{ $new_start }[1] = $type; ## Transcription factor $tf2{ $new_start }[2] = $new_end; $tf2{ $new_start }[3] = $score; ## Alignment score from PWM bl +ast $tf2{ $new_start }[4] = $strand; ## Strand where the match is +found } ## Close the unused files close TF1; close TF2; OUTER:foreach my $key_c (sort keys %tf2) { my @sort_array = (); my @pointer_array = (); my $start_c = $key_c; my $end_c = $tf2{ $key_c }[2]; INNER:foreach my $key_r (sort keys %tf1) { my $start_r = $key_r; my $end_r = $tf1{ $key_r }[2]; my $trial_1 = $start_r - $end_c; my $trial_2 = $start_c - $end_r; if ($trial_1 >= 0) { push @sort_array, $trial_1; push @pointer_array, $key_r; next INNER; } elsif ($trial_2 >=0) { push @sort_array, $trial_2; push @pointer_array, $key_r; next INNER; } else { next INNER; } } my $array = \@sort_array; ## create array ref my $array_which = \@pointer_array; ## know which tf1 binding site my @closest = bubblesmart($array, $array_which); ## pipe to su +b and get the closest value my $closest = shift @closest; my $pointer = shift @closest; if ($closest < 50) { print OUT "$tf2{ $key_c }[0]\t$tf2{ $key_c }[1]\ttf2\t$sta +rt_c\t$end_c\t$tf2{ $key_c }[3]\t$tf2{ $key_c }[4]", "tf1\t$pointer\t$tf1{ $pointer }[2]\t$tf1{ $pointer }[3]\t$tf1 +{ $pointer }[4]\t"; } else { next OUTER; } } ## Close the output file close OUT; print STDOUT scalar localtime; } ## cycle through again til done with all files ## Code taken from Mastering Algorithms with Perl, with a small modifi +cation sub bubblesmart { my $array=shift; my $array_which = shift; my $start = 0; ## The start index of the bubbling scan. my $ncomp = 0; ## The number of comparisons. my $nswap = 0; ## The number of swaps. my $i = $#$array; while (1) { my $new_start; ## The new start index of the bubbling scan. my $new_end = 0; ## The new end index of the bubbling scan. for (my $j = $start || 1; $j <= $i; $j++) { $ncomp++; if ($array->[ $j -1 ] gt $array->[ $j ]) { @$array[ $j, $j -1 ] = @$array[ $j - 1, $j ]; @$array_which[ $j, $j -1 ] = @$array_which[ $j - 1, $j ]; $nswap++; $new_end = $j - 1; $new_start = $j - 1 unless defined $new_start; } } last unless defined $new_start; ## No swaps: we're done. $i = $new_end; $start = $new_start; } my $return_value = pop @$array; my $return_point = pop @$array_which; #print STDOUT "bubblesmart: ", scalar @$array, " elements, $ncomp +comparisons, $nswap swaps\n"; return $return_value, $return_point; }
Bioinformatics

Replies are listed 'Best First'.
Re: A more efficient sort or heap algorithm...
by BrowserUk (Patriarch) on Apr 28, 2009 at 22:42 UTC

    If I'm right in thinking that your are only using a hand-coded bubblesort so that you can sort two parallel arrays together, then you are right in thinking that there is a better way. Actually several.

    1. One way would be to avoid using parallel arrays.

      Instead of:

      push @sort_array, $trial_1; push @pointer_array, $key_r;

      You build an Array of Arrays:

      push @sort_array, [ $trial_1, $key_r ];

      You can then sort both datasets using the (much faster) built-in sort routine like so:

      my @sorted = sort{ $a->[ 0 ] <=> $b->[ 0 ] } @sort_array;

      And get your $closest and $pointer values like this:

      my( $closest, $pointer ) = @{ $sorted[ 0 ] };
    2. The second way is to sort the indices by the values in the first array. These will then be applicable to the second array also. See the code below:

    But you don't need to sort at all!

    As you are only after the key associated with the closest pairing, you don't need to build any arrays, nor sort. You can just keep track of the lowest value (and associated key) as you traverse the datasets.

    Also, as you are comparing all the keys from both datasets with each other, there is no need to sort the keys either. The following should run substantially more quickly:

    #!/usr/bin/perl -slw use strict; use integer; use vars qw( %tf1 %tf2 ); ## Establish an array of all chromosome files. my @files = ("chr1", "chr2", "chr3", "chr4", "chr5", "chr6", "chr7", " +chr8", "chr9", "chr10", "chr11", "chr12", "chr13", "chr14", "chr15", "ch +r16", "chr17", "chr18", "chr19", "chr20", "chr21", "chr22", "chrX", "chr +Y"); ## Loop through each file, calculate the overlaps, and output to file. foreach (@files) { print STDOUT scalar localtime; open TF1, "<", "TF1.$_.gff" or die "Cannot open the TF1 hit file!" +; ## tf1 hit file open TF2, "<", "TF2.$_.gff" or die "Cannot open the TF2 hit file!" +; ## tf2 hit file open OUT, ">", "both.$_.gff" or die "Cannot open the output file!" +; ## output file while (<TF1>) { chomp; my ($location, undef, $type, $start, $end, $score, $strand, un +def) = split "\t"; ## Parse my ($chr, $location_numbers) = split "_", $location; + ## Get the chromosome my ($target_start, $target_end) = split "-", $location_numbers +; ## Get the start site for the 10kb sequence "blasted" against my $new_start = $target_start + $start; ## Use this for true +genomic location my $new_end = $target_start + $end; ## Use thie for true +genomic location ## Build a hash and put the data inside; $tf1{ $new_start }[0] = $chr; $tf1{ $new_start }[1] = $type; ## Transcription factor $tf1{ $new_start }[2] = $new_end; $tf1{ $new_start }[3] = $score; ## Alignment score from PWM bl +ast $tf1{ $new_start }[4] = $strand; ## Strand where the match is +found } while (<TF2>) { chomp; my ($location, undef, $type, $start, $end, $score, $strand, un +def) = split "\t"; ## Parse my ($chr, $location_numbers) = split "_", $location; + ## Get the chromosome my ($target_start, $target_end) = split "-", $location_numbers +; ## Get the start site for the 10kb sequence "blasted" against my $new_start = $target_start + $start; ## Use this for true +genomic location my $new_end = $target_start + $end; ## Use thie for true +genomic location ## Build a hash and put the data inside; $tf2{ $new_start }[0] = $chr; $tf2{ $new_start }[1] = $type; ## Transcription factor $tf2{ $new_start }[2] = $new_end; $tf2{ $new_start }[3] = $score; ## Alignment score from PWM bl +ast $tf2{ $new_start }[4] = $strand; ## Strand where the match is +found } ## Close the unused files close TF1; close TF2; foreach my $key_c ( keys %tf2 ) { my( $best, $best_key ) = ( 1e99, '' ); my $start_c = $key_c; my $end_c = $tf2{ $key_c }[2]; foreach my $key_r ( keys %tf1) { my $start_r = $key_r; my $end_r = $tf1{ $key_r }[2]; my $trial_1 = $start_r - $end_c; my $trial_2 = $start_c - $end_r; if ($trial_1 >= 0) { if( $trial_1 < $best ) { $best = $trial_1; $best_key = $key_r; } } elsif ($trial_2 >=0) { if( $trial_2 < $best ) { $best = $trial_2; $best_key = $key_r; } } } if( $best < 50 ) { print OUT join "\t", $tf2{ $key_c }[0], $tf2{ $key_c }[1], 'tf2', $start_c, + $end_c, $tf2{ $key_c }[3], $tf2{ $key_c }[4], 'tf1', $best_key +, $tf1{ $best_key}[2], $tf1{ $best_key}[3], $tf1{ $best_ +key}[4]; } } ## Close the output file close OUT; print STDOUT scalar localtime; } ## cycle through again til done with all files

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Your advice is fantastic as per usual. The array of arrays does make a lot more sense, suggestion #3 makes even more than the current stuff. I'll try both to get a better feel how each performs, and try and be a little more clever next time. Thanks :).
      Bioinformatics
        You could also take a look at how GFF files are stored in Bio::DB::GFF or Bio::DB::SeqFeature::Store which use RDMS to handle these types of range queries (mysql,postgres, sqlite) and an in-memory berkeleydb implementation. The get_features_by_location is basically what you'd want to use for your problem.
Re: A more efficient sort or heap algorithm...
by BrowserUk (Patriarch) on Apr 28, 2009 at 21:56 UTC
      I'm still learning about efficiency (obviously), but the benefit of using the hand coded sort was to allow me to have two arrays sorted the same way (but based on the first, if that makes sense), with the second array containing a hash key leading to the data for the corresponding region. I wasn't sure how to implement that using the built in sort function. It allows me to keep track of and print out the information for both regions.
      Bioinformatics
        I wasn't sure how to implement that using the built in sort function. It allows me to keep track of and print out the information for both regions.

        Just try to sort the indices and then map both arrays by them:

        my @ind_srtd = sort { $array->[$a] cmp $array->[$b] } (0..$#$array); my @array_srtd = map { $array->[$_] } @ind_srtd; my @array_which_srtd = map { $array_which->[$_] } @ind_srtd;

        I didn't tested it thoroughly, but should look something like this.

Re: A more efficient sort or heap algorithm...
by jethro (Monsignor) on Apr 28, 2009 at 21:45 UTC
    If you want input from everyone instead of only the biologists you probably have to describe your data and what '50 base-pairs away' means in the context of that data.
      Point well taken. The gist of the problem is that I have two large integers that I need to parse from each line in a file, and then compare the integers from each file. The base-pair comment denoted (in my head anyway) that these integers reference a position in the genome.
      Bioinformatics