sirna has asked for the wisdom of the Perl Monks concerning the following question:

hi all , I have written a script to count the number of RNAs that have exact matches to the genome. the code works just fine. but the problem is I would like to improve the code make that work faster. the genome has 5 chromosomes. and they are all >= 3045655 (in length)
#!/usr/bin/perl -w use DBI; use strict; use Bio::SeqIO; use Bio::Perl; use GD::Graph::bars; $|=1; my $window_size = 500; my $step_size = 500; my $connect; my $database = 'Cloned_RNA_2_dev'; my $user ='root'; my $pass = ''; my $host = 'localhost'; my %sequences; my $total_length; my $length; open (RESULT, ">result.csv"); my $dsn = qq(DBI:mysql:database=$database; host=$host); $connect = DBI-> connect($dsn, $user, $pass,{printError =>1}) or die $ +DBI::errstr; #querying the database my $query = qq/select count(c.Rna_sequence_sequence) from Cloned_rna c, Hsp h where c.Rna_sequence_sequence=h.Rna_sequence_sequence and c.Project_idproject =72 and h.Sequence_db_idSequence_db =1133 and h.accession =? and h.hit_start >= ? and h.hit_end<=? group by h.accession/; my $sql = $connect->prepare($query); my %window_counts ; print RESULT "\"Chromosome\"\t\"start\"\t\"end\"\t\"counts\"\n" ; my $genome = Bio::SeqIO->new(-file=>'/home/shared-projects/sequence-db +s/TAIR7/TAIR7_nuclear_genome.fna', -format=>'fasta'); while (my $seq = $genome->next_seq()){ $length = $seq->length(); my $id =$seq->id(); warn "Processing on Chromosome $id\n"; #walking along the genome for(my $i=1;$i<=($length-$window_size); $i+=$step_size){ my $window_start = $i; my $window_end = $i+$window_size-1; my $sucess=($sql->execute($id, $window_start, $window_end)); die "query failed!\n $query \n" unless $sucess; my $window_id = "$id,$window_start" ; if (my @result = $sql->fetchrow_array() ) { ( $window_counts{$window_id})= @result; # has the counts as the + value of the hash warn "\$window_counts{$window_id}='$window_counts{$window_id}' +\n"; print RESULT "$id\t$window_start\t$window_end\t$window_counts{ +$window_id}\n" ; } # exit if $i > 50000; } }
is there any other way to do this??? thanks in advance.

Replies are listed 'Best First'.
Re: how to improve this?
by lima1 (Curate) on Nov 13, 2007 at 12:33 UTC
    Maybe you find Bio::Grep useful. The fastest solution for your task is probably an enhanced suffix array (for a very short introduction to simple suffix arrays see http://en.wikipedia.org/wiki/Suffix_array). It needs a few minutes (and probably a lot of RAM) to construct them, but then you can search in O(m) (m query length) for exact matches.

    Bio::Grep also supports the small tool GUUGle, which isn't that fast but does not require a precalculation and needs less RAM. In addition, it supports GU wobble pairs.

    Update: If you don't want to use this module, then you should at least fetch your queries once and then use index or maybe better a regex.

Re: how to improve this?
by moritz (Cardinal) on Nov 13, 2007 at 12:10 UTC
    First you have to know wich part of the program is slow. Is it the DB query? Or the output? or the perl code itself?

    To find the bottleneck I'd recommend reading the thread RFC: Profiling your code..

    When you know which part is the slowest one you can start to optimize it. But don't guess, you'll be wrong (usually).