in reply to Up for Critique
First things first - the script is very readable. I'm impressed! :-) Also, there is almost enough commenting. *winks*
To answer your question - yes, there are a number of places you could optimize for speed (and maintainability!).
- The biggest one (for both aspects) is to declare your variables when you use them. This improves maintainability because you're scoping variables to only where you need them. This improves speed because you're not having to clear the arrays each time.
### FILL ACHROMS TABLE ###########################################
# Take in whole records separated by five dashes instead of a
# newline character.
$/ = "-----";
while($record = <INFILE>)
{
# Print out a counter to show how far the filling has progressed.
# print "$n\n";
# $n++;
@newrecord = ();
# Send the record to have any blank lines removed.
@newrecord = CleanData( $record );
-----
### FILL ACHROMS TABLE ###########################################
# Take in whole records separated by five dashes instead of a
# newline character.
$/ = "-----";
while(my $record = <INFILE>)
{
# Print out a counter to show how far the filling has progressed.
# print "$n\n";
# $n++;
# Send the record to have any blank lines removed.
my @newrecord = CleanData( $record );
- The second is to pass around references, not the whole array itself. For example:
@newrecord = CleanData( $record );
----
my $newrecord = CleanData($record);
There, $newrecord is a reference to an array. You'd get your stuff by using $newrecord->[5] instead of $newrecord[5]. If your arrays are large, this is a signficant savings.
- CleanData() should be using grep. Something like:
sub CleanData
{
my ($record, @cleanrecord, $line, @record);
$record = $_[0];
### Separate the record by newlines.
@record = split /\n/, $record;
### Remove empty lines if they exist.
foreach $line ( @record )
{
chomp( $line );
next if ( $line =~ m/^\s*$/ );
push @cleanrecord, $line;
}
return ( @cleanrecord );
}
----
sub CleanData
{
my $record = shift;
my @cleanrecord = grep { /\S/ } chomp(split($/, $record));
return \@cleanrecord;
}
That is much faster and makes much clearer what you're doing. The $/ is the newline separator for your system. It makes your script portable to Windows or Mac (for example).
- Use array slices. For example:
$sth_update_exon -> execute( @exondata[1,2,3,4],
$geneid, $exondata[0] );
They're faster as well as more readable.
- Don't use fetchrow_array(). Use fetch() with bind_columns() instead. This is how the DBI documentation recommends to do things the fastest. In addition, it seems like you're just using fetchrow_array() to find out if anything matched at all. Using fetch() is definitely quicker in that instance. (There's probably an even faster way just to check for existence, but I don't know it.)
- In ParseAChroms(), you're doing a regex against the same variable over and over. In addition, most of those regexes look to be identical. This makes me suspicious. Try something like:
sub ParseAChroms {
my $achromref = shift;
my $matches = join '|', (
'BAC:',
'LENGTH:',
'OSOME:',
'CHR_START:',
'CHR_END:',
'BAC_START:',
'BAC_END:',
'ORIENTATION:',
);
my @achrom;
my $line = join '', @$achromref;
push @achrom, $line =~ /\b([BY]AC\b/;
push @achrom, $line =~ /(?:$matches)\s*(.*?)\s*/g;
return \@achrom;
}
Now, this assumes that all your stuff is going to be in the right order. If you can't, don't use an array - use a hash.
sub ParseAChroms {
my $achromref = shift;
my $matches = join '|', (
'BAC:',
'LENGTH:',
'OSOME:',
'CHR_START:',
'CHR_END:',
'BAC_START:',
'BAC_END:',
'ORIENTATION:',
);
my $line = join '', @$achromref;
my %achrom = $line =~ /($matches)\s*(.*?)\s*/og;
$achrom{TYPE} = $line =~ /\b([BY]AC\b/o;
return \%achrom;
Now, what's even cooler is that you can access stuff by name, not array index. This is important if you ever plan on changing your input file stuff, maybe adding a new thing you want to track?
You could even make that regex ahead of time by using qr//, but I'll leave that as an exercise to the reader. And, yes, ParseGenes(), ParseExon(), and ParseSubExon() should be treated the same way.
- A few style notes:
------ We are the carpenters and bricklayers of the Information Age. Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement.
Re: Re: Up for Critique
by biograd (Friar) on Mar 23, 2002 at 06:52 UTC
|
Heh...yeah, Dragonchild, they want MORE documentation than
this. (!) That's OK. I'm basically a language oriented sort of
person anyway. Documenting doesn't bother me much.
You'll notice the date on the script is from last year. There
are some things I do now that I didn't do then, such as initializing
the locally scoped variables ("my") where they are used
instead of way up at the top. Most of the suggestions I see
in your note and others I just wouldn't have thought of yet
though.
So, thanks to all for the ton of good suggestions. I'm not sure how many
I'll have time to implement before my deadline, but this
is a great help for improving my skill level.
-Jenn | [reply] [Watch: Dir/Any] |
|
|