Re: Up for Critique

First things first - the script is very readable. I'm impressed! :-) Also, there is almost enough commenting. *winks*

To answer your question - yes, there are a number of places you could optimize for speed (and maintainability!).

The biggest one (for both aspects) is to declare your variables when you use them. This improves maintainability because you're scoping variables to only where you need them. This improves speed because you're not having to clear the arrays each time.

### FILL ACHROMS TABLE ###########################################

  # Take in whole records separated by five dashes instead of a
  # newline character.
  $/ = "-----";
  while($record = <INFILE>)
  {
    # Print out a counter to show how far the filling has progressed.
#    print "$n\n";
#    $n++;

    @newrecord = ();
    # Send the record to have any blank lines removed.
    @newrecord = CleanData( $record );
-----
### FILL ACHROMS TABLE ###########################################

  # Take in whole records separated by five dashes instead of a
  # newline character.
  $/ = "-----";
  while(my $record = <INFILE>)
  {
    # Print out a counter to show how far the filling has progressed.
#    print "$n\n";
#    $n++;

    # Send the record to have any blank lines removed.
    my @newrecord = CleanData( $record );
[download]

The second is to pass around references, not the whole array itself. For example:
```
    @newrecord = CleanData( $record );
----
    my $newrecord = CleanData($record);
[download]
```
There, $newrecord is a reference to an array. You'd get your stuff by using $newrecord->[5] instead of $newrecord[5]. If your arrays are large, this is a signficant savings.

CleanData() should be using grep. Something like:

sub CleanData
{
    my ($record, @cleanrecord, $line, @record);
    $record = $_[0];

    ### Separate the record by newlines.
    @record = split /\n/, $record;

    ### Remove empty lines if they exist.
    foreach $line ( @record )
    {
      chomp( $line );
      next if ( $line =~ m/^\s*$/ );
      push @cleanrecord, $line;
    }
    return ( @cleanrecord );
}
----
sub CleanData
{
    my $record = shift;

    my @cleanrecord = grep { /\S/ } chomp(split($/, $record));

    return \@cleanrecord;
}
[download]

That is much faster and makes much clearer what you're doing. The $/ is the newline separator for your system. It makes your script portable to Windows or Mac (for example).

Use array slices. For example:

    $sth_update_exon -> execute( @exondata[1,2,3,4],
                                 $geneid, $exondata[0] );
[download]

They're faster as well as more readable.

Don't use fetchrow_array(). Use fetch() with bind_columns() instead. This is how the DBI documentation recommends to do things the fastest. In addition, it seems like you're just using fetchrow_array() to find out if anything matched at all. Using fetch() is definitely quicker in that instance. (There's probably an even faster way just to check for existence, but I don't know it.)

In ParseAChroms(), you're doing a regex against the same variable over and over. In addition, most of those regexes look to be identical. This makes me suspicious. Try something like:

sub ParseAChroms {
    my $achromref = shift;

    my $matches = join '|', (
        'BAC:',
        'LENGTH:',
        'OSOME:',
        'CHR_START:',
        'CHR_END:',
        'BAC_START:',
        'BAC_END:',
        'ORIENTATION:',
    );

    my @achrom;
    my $line = join '', @$achromref;
    push @achrom, $line =~ /\b([BY]AC\b/;
    push @achrom, $line =~ /(?:$matches)\s*(.*?)\s*/g;

    return \@achrom;
}
[download]

Now, this assumes that all your stuff is going to be in the right order. If you can't, don't use an array - use a hash.

sub ParseAChroms {
    my $achromref = shift;

    my $matches = join '|', (
        'BAC:',
        'LENGTH:',
        'OSOME:',
        'CHR_START:',
        'CHR_END:',
        'BAC_START:',
        'BAC_END:',
        'ORIENTATION:',
    );

    my $line = join '', @$achromref;
    my %achrom = $line =~ /($matches)\s*(.*?)\s*/og;
    $achrom{TYPE} = $line =~ /\b([BY]AC\b/o;

    return \%achrom;
[download]

Now, what's even cooler is that you can access stuff by name, not array index. This is important if you ever plan on changing your input file stuff, maybe adding a new thing you want to track?

You could even make that regex ahead of time by using qr//, but I'll leave that as an exercise to the reader. And, yes, ParseGenes(), ParseExon(), and ParseSubExon() should be treated the same way.

A few style notes:
- Your variable names could be improved with either capitalization (like your subroutine names) or underscores. Doesn't matter which one, just be consistent.
- When you override $/ (or any similar variable), use local and do it within a block. That makes for good maintainable code.
- Instead of making a variable for each sth, make a hash of sths and access the one you need. Even better would be to have a subroutine somewhere that does this for you.
- Don't use a variable named temp. Ever. You can find a better name. Also, when you do, it looks like your just using it as a boolean test. If so, then just do the thing in the boolean test. (if and while take boolean tests, in case you're wondering.)
- Don't use the same variable name for both a scalar and an array (such as the aforementioned temp). It's confusing and leads to (sometimes subtle) bugs.
- Use heredocs with your SQL statements for prepare(). It's more readable. And, put it into a subroutine (preferably in some other module!) Something like:
```
my $match_achrom_statement = << STH_MATCH;
SELECT ac_id
  FROM achroms
 WHERE ac_id = ?
STH_MATCH

my $sth_match_achrom = $dbh->prepare($match_achrom_statement);
[download]
```

------
We are the carpenters and bricklayers of the Information Age.

Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement.

Comment on Re: Up for Critique Select or Download Code