Genome UV Mutation Script -- Critique

c4onastick has asked for the wisdom of the Perl Monks concerning the following question:

Greetings Monks, I've just finished my second (larger) perl project. Since I'm still new to perl I thought I'd post it here to get some feedback to help in my learning (this is my second post).

The problem: I'm working on UV mutations in bacteria in my lab. I was bored with my laptop (dangerous combination) one day so I asked myself, "I wonder how many possible dimerization (mutation) sites are in the bacteria's genome?" A few quick regexes later, I had an answer (quicker than I thought too, since the genome I was analyzing had about 4.7 million base pairs). I decided to take it a step further and see which genes had the highest probability (number of sites) for mutation. This took a little longer (and looks uglier, IMHO).

I'd like you, my fellow monks, to critique my work. How did I do? How can I make it better, more efficient, and cleaner? This is the first thing I've written (and I do realize that there's gotta be some other program out there to do this, but where's the fun in that!) that could be of use to other people, so I'd like to make it as robust as possible. Thanks in advance!

The genomic data that goes in to the script looks like this (I wont post the whole thing, its huge), from PubMed:

     gene            337..2799
                     /gene="thrA"
                     /locus_tag="t0002"
                     /db_xref="GeneID:1066974"
     CDS             337..2799
                     /gene="thrA"
                     /locus_tag="t0002"
                     /note="multifunctional homotetrameric enzyme that
                     catalyzes the phosphorylation of aspartate to for
+m
                     aspartyl-4-phosphate as well as conversion of asp
+artate
                     semialdehyde to homoserine; functions in a number
+ of amino
                     acid biosynthetic pathways"
                     /codon_start=1
                     /transl_table=11
                     /product="bifunctional aspartokinase I/homeserine
                     dehydrogenase I"
                     /protein_id="NP_803887.1"
                     /db_xref="GI:29140545"
                     /db_xref="GeneID:1066974"
                     /translation="MRVLKFGGTSVANAERFLRVADILESNSRQGQVAT
+VLSAPAKIT
                     NHLVAMIEKTIGGQDALPNISDAERIFSDLLAGLASAQPGFPLARLKMV
+VEQEFAQIK
                     HVLHGISLLGQCPDSINAALICRGEKMSIAIMAGLLEARGHRVTVIDPV
+EKLLAVGHY
                     LESTVDIAESTRRIAASQIPADHMILMAGFTAGNEKGELVVLGRNGSDY
+SAAVLAACL
                     RADCCEIWTDVDGVYTCDPRQVPDARLLKSMSYQEAMELSYFGAKVLHP
+RTITPIAQF
                     QIPCLIKNTGNPQAPGTLIGASSDDDNLPVKGISNLNNMAMFSVSGPGM
+KGMIGMAAR
                     VFAAMSRAGISVVLITQSSSEYSISFCVPQSDCARARRAMQDEFYLELK
+EGLLEPLAV
                     TERLAIISVVGDGMRTLRGISAKFFAALARANINIVAIAQGSSERSISV
+VVNNDDATT
                     GVRVTHQMLFNTDQVIEVFVIGVGGVGGALLEQLKRQQTWLKNKHIDLR
+VCGVANSKA
                     LLTNVHGLNLDNWQAELAQANAPFNLGRLIRLVKEYHLLNPVIVDCTSS
+QAVADQYAD
                     FLREGFHVVTPNKKANTSSMDYYHQLRFAAAQSRRKFLYDTNVGAGLPV
+IENLQNLLN
                     AGDELQKFSGILSGSLSFIFGKLEEGMSLSQATALAREMGYTEPDPRDD
+LSGMDVARK
                     LLILARETGRELELSDIVIEPVLPDEFDASGDVTAFMAHLPQLDDAFAA
+RVAKARDEG
                     KVLRYVGNIEEDGVCRVKIAEVDGNDPLFKVKNGENALAFYSHYYQPLP
+LVLRGYGAG
                     NDVTAAGVFADLLRTLSWKLGV"
     gene            2801..3730
                     /gene="thrB"
                     /locus_tag="t0003"
                     /db_xref="GeneID:1066981"
     CDS             2801..3730
                     /gene="thrB"
                     /locus_tag="t0003"
                     /note="catalyzes the formation of O-phospho-L-hom
+oserine
                     from L-homoserine in threonine biosynthesis from 
+asparate"
                     /codon_start=1
                     /transl_table=11
                     /product="homoserine kinase"
                     /protein_id="NP_803888.1"
                     /db_xref="GI:29140546"
                     /db_xref="GeneID:1066981"
                     /translation="MVKVYAPASSANMSVGFDVLGAAVTPVDGTLLGDV
+VSVEAADHF
                     RLHNLGRFADKLPPEPRENIVYQCWERFCQALGKTIPVAMTLEKNMPIG
+SGLGSSACS
                     VVAALVAMNEHCGKPLNDTRLLALMGELEGRISGSIHYDNVAPCFLGGM
+QLMIEENGI
                     ISQQVPGFDEWLWVLAYPGIKVSTAEARAILPAQYRRQDCIAHGRHLAG
+FIHACYSRQ
                     PQLAAALMKDVIAEPYRARLLPGFSQARQAVSEIGALASGISGSGPTLF
+ALCDKPETA
                     QRVADWLSKHYLQNQEGFVHICRLDTAGARVVG"
[download]

And the actual genome (found at the bottom of the file) looks like this:

ORIGIN      
        1 agagattacg tctggttgca agagatcata acaggggaaa ttgattgaaa ataaa
+tatat
       61 cgccagcagc acatgaacaa gtttcggaat gtgatcaatt taaaaattta ttgac
+ttagg
      121 cgggcagata ctttaaccaa tataggaata caagacagac aaataaaaat gacag
+agtac
      181 acaacatcca tgaaccgcat cagcaccacc accattacca ccatcaccat tacca
+caggt
      ...
  4791781 acgcgcgcgc cttttacgcc tgctaaccac tctggaggcg gccgatgacc acaaa
+ttaac
  4791841 cgactggcta caacagcgaa tcggcctgct gggacagcga gatacggcaa tgttg
+caccg
  4791901 tttggtccat gatattgaaa aaaaactaac aaaataacgt gttgtaattt ttaaa
+ataat
  4791961 a
//
[download]

The script, finds all possible locations for dimerization (mutation) in the genome, finds all possible locations for dimerization (mutation) in each gene and calculates the percentatge chance that a mutation in the genome is in that gene (I built my first hash of hashes for this) then lists the genes (in either straight text for the console or simple html) in descending probability of mutation.

#!/usr/bin/perl

use warnings;
use strict;
use Getopt::Std;
local $/;

our %opts;
getopts('hf:', \%opts);

die("Usage: uv_mutant.pl -f <file>.gbk\nAdd -h for html output\n") unl
+ess $opts{f};
my $file = $opts{f};
my $genome;
my $total_mutations;

open(FH, $file) or die "File couldn't be opened";
my $contents = <FH>;
close(FH);

#Extract the entire genome
$contents =~ m#ORIGIN(.+?)//#s or die "No genome data found.";
$genome = $1;

# Remove extraneous characters, make it one big long string to use sub
+str position on it
$genome =~ s/[\d\s]+//g;

# Calculate total possible mutations
while( $genome =~ /[ct](?=[ct])/g ) {
    $total_mutations++;
}

#print "\nTotal possible mutations (pyramidine dimerizations): $total_
+mutations\n\n";

# Extract all the gene definitions, end at protein translation.
my @genes;
@genes = $contents =~ m#(?<!/)gene.+?/translation#sg;

my %mutant_genes;

# Set up output
#printf "%-20s%-10s%-25s%-15s%s\n", "Gene", "GeneID", "Possible Mutati
+ons", "Probability", "Protein Product";
#print "=" x 85, "\n";

foreach my $gene (@genes) {
    # filter out complement genes
    if( $gene =~ m/gene\s+(\d+)\.\.(\d+)/ ) {
        my $gene_mutations = 0;
        my $gene_name;
        my $geneid = "Unknown"; #GeneID for online db
        my $gene_product = "Unknown"; #Encoded protein

        my $start = $1; $start--; #increment down for substr
        my $end = $2; $end--; #increment down for substr
        my $length = $end - $start;
        ($gene_name) = $gene =~ m#/gene="([^"]+)"#;
        unless( defined($gene_name) ) {
            $gene_name = "Unknown$length";
        }
        my $sub_genome = substr($genome, $start, $length);
        $gene_mutations = () = $sub_genome =~/[ct](?=[ct])/g;
        my $probability = $gene_mutations/$total_mutations*100;
        
        #Pull out GeneID (if exists)
        if( $gene =~ m#/db_xref="GeneID:(\d+)"# ) {
            $geneid = $1;
        }
        #Pull out Protein Product, if exists
        if( $gene =~ m#/product="([^"]+)"# ) {
            $gene_product = $1;
            $gene_product =~ s/\n\s*/ /g; #Clear out newlines and inde
+ntation
        }

        $mutant_genes{$gene_name} = {    gene_id => $geneid,
                        prod_pro => $gene_product,
                        gene_mutants => $gene_mutations,
                        mutant_prob => $probability
                        };

        #printf "%-20s%-10d%-25d%.5f%% %s\n", $gene_name, $geneid, $ge
+ne_mutations, $probability, $gene_product;
    }
}


if($opts{h}) {
    html_out($total_mutations, %mutant_genes)
}else{
    print "UV Mutation (pyramidine dimerization) Analysis\n";
    print "Total possible mutations in genome: $total_mutations\n\n";
    print "\nGenes sorted by UV mutation probability:\n", "=" x 65, "\
+n";
    
    foreach (sort by_descending_probability keys %mutant_genes) {
        printf "%-20s%.5f%%   %s\n", $_, $mutant_genes{$_}{mutant_prob
+}, $mutant_genes{$_}{prod_pro};
    }
}

sub by_descending_probability {
    $mutant_genes{$b}{mutant_prob} <=> $mutant_genes{$a}{mutant_prob};
}

sub html_out {
    my $total_muts = shift;
    print "<html>\n<head>\n<body>\n";
    print "<h1>UV Mutant Analysis</h1>\n";
    print "<b>Total Possible mutations in Genome: $total_muts </b><br 
+/>\n";
    print "<b>Gene mutations sorted by decending probability of mutati
+on</b><br />\n";
    print "<table>\n<tr><td>Gene</td><td>Possible Gene Mutations</td><
+td>Mutation Probability (%)</td><td>Gene Product</td></tr>\n";
    #my %mutant_genes = shift; #Gives an odd numbered hash assignment 
+error when prototyped
    foreach (sort by_descending_probability keys %mutant_genes) {
        print "<tr><td><a href=\"http://www.ncbi.nlm.nih.gov/entrez/qu
+ery.fcgi?db=gene&cmd=Retrieve&dopt=Graphics&list_uids=$mutant_genes{$
+_}{gene_id}\">$_</a></td><td>$mutant_genes{$_}{gene_mutants}</td><td>
+$mutant_genes{$_}{mutant_prob}</td><td>$mutant_genes{$_}{prod_pro}</t
+d></tr>\n";
    }
    print "</table>\n";
    print "</body>\n</html>";
}
[download]

Other than a prototype error when I prototyped the html_out sub it works great. Thanks again!

UPDATE: Thank you again for all your comments. I've adopted most of your suggestions (separated out the regex analysis into a sub, used a reference for html_out, other little fixes) as well as expand it to analyze the type of dimerization (TT, CT, or CC) since CC and CT dimerizations occur much less frequently than TT dimerizations.

#!/usr/bin/perl

use warnings;
use strict;
use Getopt::Std;
use Number::Format qw(:subs);
undef $/;

my $num_precision = 5;

our %opts;
getopts('hf:', \%opts);

die("Usage: uv_mutant.pl -f <file>.gbk\nAdd -h for html output\n") unl
+ess $opts{f};
my $file = $opts{f};
my $genome;

open(FH, '<', $file) or die "File couldn't be opened: $!";
my $contents = <FH>;
close(FH);

#Extract the entire genome
$contents =~ m#ORIGIN(.+?)//#s or die "No genome data found.";
$genome = $1;

# Remove extraneous characters, make it one big long string to use sub
+str position on it
$genome =~ s/[\d\r\n\s]+//g;

# Calculate total possible mutations
my %mutations = find_possible_mutations($genome);

# Extract all the gene definitions, end at protein translation.
my @genes;
@genes = $contents =~ m#(?<!/)gene.+?/translation#sg;

my %mutant_genes;

# Set up output
#printf "%-20s%-10s%-25s%-15s%s\n", "Gene", "GeneID", "Possible Mutati
+ons", "Probability", "Protein Product";
#print "=" x 85, "\n";

foreach my $gene (@genes) {
    # filter out complement genes
    if( $gene =~ m/gene\s+(\d+)\.\.(\d+)/ ) {
        my $gene_name;
        my $geneid = "Unknown"; #GeneID for online db
        my $gene_product = "Unknown"; #Encoded protein

        my $start = $1 - 1; #$start--; #increment down for substr
        my $end = $2 - 1; #$end--; #increment down for substr
        my $length = $end - $start;
        ($gene_name) = $gene =~ m#/gene="([^"]+)"#;
        unless( defined($gene_name) ) {
            $gene_name = "Unknown$length";
        }

        my $sub_genome = substr($genome, $start, $length);
        my %gene_mutations = find_possible_mutations($sub_genome);

        my %probability = (     ptt => format_number($gene_mutations{t
+t}/$mutations{tt}*100, $num_precision),
                    pct => format_number($gene_mutations{ct}/$mutation
+s{ct}*100, $num_precision),
                    pcc => format_number($gene_mutations{cc}/$mutation
+s{cc}*100, $num_precision),
                    ptotal => format_number($gene_mutations{total}/$mu
+tations{total}*100, $num_precision)
                );
        
        #Pull out GeneID (if exists)
        if( $gene =~ m#/db_xref="GeneID:(\d+)"# ) {
            $geneid = $1;
        }
        #Pull out Protein Product, if exists
        if( $gene =~ m#/product="([^"]+)"# ) {
            $gene_product = $1;
            $gene_product =~ s/\n\s*/ /g; #Clear out newlines and inde
+ntation
        }

        $mutant_genes{$gene_name} = {    gene_id => $geneid,
                        prod_pro => $gene_product,
                        %gene_mutations,
                        %probability
                        };
    }
}


if($opts{h}) {
    html_out($mutations{total}, \%mutant_genes)
}else{
    print "UV Mutation (pyramidine dimerization) Analysis\n";
    print "Total possible mutations in genome: $mutations{total}\n\n";
    print "\nGenes sorted by UV mutation probability:\n", "=" x 65, "\
+n";
    
    foreach (sort by_descending_probability keys %mutant_genes) {
        printf "%-20s%.5f%%   %s\n", $_, $mutant_genes{$_}{ptt}, $muta
+nt_genes{$_}{prod_pro};
    }
}

sub by_descending_probability {
    $mutant_genes{$b}{ptt} <=> $mutant_genes{$a}{ptt};
}

sub html_out {
    my ($total_muts, $mutant_ref) = @_;
    my %mutant_genes = %{$mutant_ref};
    print "<html>\n<head>\n<body>\n";
    print "<h1>UV Mutant Analysis</h1>\n";
    print "<b>Total Possible mutations in Genome: $total_muts</b><br /
+>\n";
    print "<b>Gene mutations sorted by decending probability of mutati
+on</b><br />\n";
    print "<table>\n<tr><td>Gene</td><td>Possible Gene Mutations TT</t
+d><td>Possible Gene Mutations CT</td><td>Possible Gene Mutations CC</
+td><td>Mutation Probability TT(%)</td><td>Mutation Probability CT(%)<
+/td><td>Mutation Probability CC(%)</td><td>Gene Product</td></tr>\n";
    #my %mutant_genes = shift; #Gives an odd numbered hash assignment 
+error when prototyped
    foreach (sort by_descending_probability keys %mutant_genes) {
        print "<tr><td><a href=\"http://www.ncbi.nlm.nih.gov/entrez/qu
+ery.fcgi?db=gene&cmd=Retrieve&dopt=Graphics&list_uids=$mutant_genes{$
+_}{gene_id}\">$_</a></td><td>$mutant_genes{$_}{tt}</td><td>$mutant_ge
+nes{$_}{ct}</td><td>$mutant_genes{$_}{cc}</td><td>$mutant_genes{$_}{p
+tt}</td><td>$mutant_genes{$_}{pct}</td><td>$mutant_genes{$_}{pcc}</td
+><td>$mutant_genes{$_}{prod_pro}</td></tr>\n";
    }
    print "</table>\n";
    print "</body>\n</html>";
}

sub find_possible_mutations {
    my $genome = shift;
    my %mutations = (     tt => 0,
                ct => 0,
                cc => 0,
                total => 0 ); # Set all values to zero to start incase
+ no possible sites found.
    # Find all possible Thymidine dimerizations (most common dimerizat
+ion)
    while( $genome =~ /t(?=t)/g ) {
        $mutations{tt}++;
    }
    # Find all possible heterogeneous dimerization sites (less common)
    while( $genome =~ /c(?=t)/g ) {
        $mutations{ct}++;
    }
    while( $genome =~ /t(?=c)/g ) {
        $mutations{ct}++;
    }
    # Find all possible Cystine dimerization sites (least common)
    while( $genome =~ /c(?=c)/g ) {
        $mutations{cc}++;
    }
    # Store the total mutations for later calculations
    $mutations{total} = $mutations{tt} + $mutations{ct} + $mutations{c
+c};
    return %mutations;
}
[download]

Comment on Genome UV Mutation Script -- Critique Select or Download Code

Replies are listed 'Best First'.
Re: Genome UV Mutation Script -- Critique by kyle (Abbot) on Mar 14, 2007 at 21:39 UTC
`local $/;` You have this at the top level. I think it's more clear to say `undef $/;`. The form using local is good if you're doing this inside a block and don't want to meddle with the value outside your block. `open(FH, $file) or die "File couldn't be opened";` I'd say: `open my $fh, '<', $file or die "Can't read $file: $!";` [download] The three argument open saves you from some odd cases, and the error message including the filename and `$!` tells you a lot more about what failed and why. `my $start = $1; $start--; #increment down for substr my $end = $2; $end--; #increment down for substr` [download] I'd move those lines up to immediately after the pattern match and keep them there. If you add some intervening code later that uses a pattern match, these won't work. Also, these expressions might be clearer as, for example, `my $start = $1 - 1` instead of the separate decrement. Your `html_out` would be better off accepting a hash reference instead of trying to stuff the whole hash in as arguments. You'd call it as `html_out($total_mutations, \%mutant_genes)` (note the backslash), and it would take its arguments like so: `sub html_out { my ( $total_muts, $mutant_genes_ref ) = @_; my %mutant_genes = %{$mutant_genes_ref}; # ...etc... }` [download] The rest of the code wouldn't change. I don't think it's a big deal in this case since `%mutant_genes` is global anyway (`by_descending_probability` is already using it). Have a look at reftut for more about references. On the plus side, I found it pretty readable (though I ignored the big hunks of HTML), and the comments are good.	[reply] [d/l] [select]
Re: Genome UV Mutation Script -- Critique by lima1 (Curate) on Mar 14, 2007 at 22:22 UTC
You should really take a look at Bioperl. It ships with a Genbank parser. Would be something like `my $seq_io = Bio::SeqIO->new( -file => $file, format => 'genbank'); my $whole_genome = $seq_io->next_seq;` [download] See perldoc Bio::SeqIO. This parser is quite robust...Which is important as you will notice when you screen many different species ;)	[reply] [d/l]
Re^2: Genome UV Mutation Script -- Critique by BrowserUk (Patriarch) on Mar 15, 2007 at 04:54 UTC
Be aware. Bio::SeqIO may be robust, but unless thing have changed in the last 6 months, it is very, very (ludicrously) slow. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply]
Re^3: Genome UV Mutation Script -- Critique by lima1 (Curate) on Mar 15, 2007 at 09:20 UTC
thats not really an issue with this small prokaryotic genomes. Update: i was curious and tried this with Bioperl. It is indeed MUCH slower. Thanks for the hint.	[reply]
Re: Genome UV Mutation Script -- Critique by jdporter (Paladin) on Mar 14, 2007 at 21:45 UTC
Looks fine. One thing that jumps out at me is that your regex for counting locations, `/[ct](?=[ct])/g`, is duplicated, and at the same time is being used two different ways. I'd probably stick that in a subroutine and call it in both places. Perhaps something like `sub count_possible_mutation_locations(\$) { my $sr = shift; scalar( () = $$sr =~ /[ct](?=[ct])/g ) }` [download] A word spoken in Mind will reach its own level, in the objective world, by its own weight	[reply] [d/l] [select]
Re^2: Genome UV Mutation Script -- Critique by kyle (Abbot) on Mar 14, 2007 at 21:54 UTC
Why use a prototype here? And why take a reference to the string? `sub count_possible_mutation_locations { return scalar ( () = $_[0] =~ /[ct](?=[ct])/g ); }` [download]	[reply] [d/l]
Re^3: Genome UV Mutation Script -- Critique by jdporter (Paladin) on Mar 15, 2007 at 01:42 UTC
Oh, no particular reason to in this case. Sometimes I like to do this to get a little bit of typesafety.	[reply]
Re: Genome UV Mutation Script -- Critique by jwkrahn (Abbot) on Mar 14, 2007 at 22:06 UTC
`# Remove extraneous characters, make it one big long string to use sub +str position on it $genome =~ s/[\d\r\n\s]+//g;` [download] The `\s` character class includes the characters `\r` and `\n` so including them is superfluous. If you want efficiency then use `tr///` instead of `s///`: `# Remove extraneous characters, make it one big long string to use sub +str position on it $genome =~ tr/0-9 \t\f\n\r//d;` [download]	[reply] [d/l] [select]
Re: Genome UV Mutation Script -- Critique by Anno (Deacon) on Mar 14, 2007 at 22:28 UTC
It would have been nice if you had added `[readmore]` tags around your rather massive code. The code looks okay to me. There's nothing, stylistically or logically, that sticks out as bad. (Okay, the `die` message after `open` should mention the file name and the error message. File slurping should be a local action.) I also found the code over-commented in places, but that's a matter of taste. However, it could probably make better use of existing modules. There appears to be a substantial library of genetics-oriented modules on CPAN. I'd look there for useful parts. A module that parses genome/gene information in the PubMed format doesn't seem to be out of the question. HTML generation is another thing. You could probably use a HTML template, but even plain `CGI.pm` simplifies HTML generation. The table you're printing as text output could be nicely done by `Text::Table`. That would save you the `sprintf` formats you are using for the table header and content, and their maintenance. `File::Slurp` would make file slurping a local action without effect on the rest of the script. Anno	[reply] [d/l] [select]
Re: Genome UV Mutation Script -- Critique by c4onastick (Friar) on Mar 15, 2007 at 03:24 UTC
Thank you all for your input. Although I didn't check CPAN before I started writing this, I'm not surprised that someone else has solved this problem before. I don't get enough opportunities at work to flex my perl muscles and thought this would be a great learning experience (as it has been). I will look into references, I definitely didn't understand them the first time through Larry's Programming Perl, and it seems from your comments, that they warrant a second look. Thanks again! I really appreciate everyone's input!	[reply]