InfoSeeker has asked for the wisdom of the Perl Monks concerning the following question:

Hello, I am quite new to perl...I have written code to read in protein entries and save each in a seperate Fasta (.fa) file..I manage to save the header correctly in each .fa but I am unable to save the string! What am I doing wrong? Is the $pdbString variable out of scope? Or is there an error with the I/O?

This is the input: >d1dlwa_ a.1.1.1 slfeqlggqaavqavtaqfyangidmpnqtnktaaflcaalggpnawtgrnlkevhanmgvsnaqfttvi +ghlrsaltgagvaaalveqtvavaetvrgdvvtvniqadatvatff >d1uvyc1 b.1.1.1 niqadatvatffngidmpnqtnktaaflcaalggpnawtgrnlkevhanmgvsnaqfttvighlrsaltg +agvaaalveqtvavaetvrgdvvtvslfeqlggqaavqavtaqfya
My expected output (each in a .fa file): >1dl_a_ slfeqlggqaavqavtaqfyangidmpnqtnktaaflcaalggpnawtgrnlkevhanmgvsnaqfttvi +ghlrsaltgagvaaalveqtvavaetvrgdvvtvniqadatvatff >1uvy_c1 niqadatvatffngidmpnqtnktaaflcaalggpnawtgrnlkevhanmgvsnaqfttvighlrsaltg +agvaaalveqtvavaetvrgdvvtvslfeqlggqaavqavtaqfya

THIS IS MY CODE (if there is anyway to make this more compact, I appreciate it!)

#!/usr/bin/perl -w @PDBchain; @PDBcode; @pdbString; @SCOPclass; $count =0; #SET A COUNTER FOR NUMBER OF ENTRIES open (IN, "$ARGV[0]"); my @array; while (<IN>) { my $pdbId = $_; chomp($pdbId); push @array, $pdbId; print $pdbId; print "\n\n"; if ($pdbId =~/>d(\d\w\w\w)(\w[_\d])\s(\w)/) { $pdbString[$count] = ""; $PDBcode[$count]=$1; #SAVE PDB CODE print $PDBcode[$count]; print "\n"; $PDBchain[$count]=$2; #SAVE PDB CHAIN $SCOPclass[$count]=$3; #SAVE SCOP CLASS (A,B,C,D,E,F,G +) $count++; next; } else { $pdbString[$count].=$pdbId; next; } print $pdbString[$count]; #this prints the complete fasta sequ +ence per entry! print "\n\n"; } close IN; print "Total number of entries: ".$count. "\n"; #COUNT TOTAL NUMBER OF + PDBS REPRESENTED for(my $i=0;$i<$count;$i++) { open (OUT, ">>$PDBcode[$i]_$PDBchain[$i].fa"); print OUT ">",$PDBcode[$i], "_", $PDBchain[$i], "\n"; print OUT $pdbString[$i]; #WHY AM I UNABLE TO PRINT THE SEQUENCE HERE? close OUT; }

Replies are listed 'Best First'.
Re: Problems with Variable Scope in Perl
by graff (Chancellor) on Nov 02, 2008 at 18:45 UTC
    The first problem I see is that you increment "$count" as soon as you put the first line of each entry into each of your arrays. Then when you hit the second line of each entry, $count is putting the long string of letters into the next element of your "pdbString" array. So when you go to print to your output file(s), you are getting the wrong pdbString data for each entry (and this string is empty/undefined for the first entry).

    (Update: Forgot to mention: because of the way you used your $count variable, all the "pdbString" values end up as empty strings: you increment $count before you store the string in that array, then when you hit the start of the next entry, you assign "" (empty string) to that same array element.)

    Here's a cleaned up version of your code, with "use strict" included (because that's what all good monks do), and a fair bit of seemingly unnecessary stuff left out.

    #!/usr/bin/perl -w use strict; my ( @PDBchain, @PDBcode, @pdbString, @SCOPclass ); open (IN, $ARGV[0] ); while (<IN>) { chomp; if ( /^>d(\d\w{3})(\w[_\d])\s(\w)/ ) { push @PDBcode, $1; #SAVE PDB CODE push @PDBchain, $2; #SAVE PDB CHAIN push @SCOPclass, $3; #SAVE SCOP CLASS (A,B,C,D,E,F,G) push @pdbString, ""; } else { $pdbString[$#pdbString] .= $_; } } close IN; print "Total number of entries: ".scalar( @PDBcode ). "\n"; #COUNT TOT +AL NUMBER OF PDBS REPRESENTED for ( 0 .. $#PDBcode ) { open (OUT, ">>$PDBcode[$_]_$PDBchain[$_].fa"); print OUT ">",$PDBcode[$_], "_", $PDBchain[$_], "\n"; print OUT $pdbString[$_], "\n"; close OUT; }
    Given the input you showed, that produces two output files, with one entry in each file. BTW, the OP code saved stuff into a "@SCOPclass" array, but then never used it. Was that intentional?

    Oh, another thing about the OP code: the last two print statements of the while loop are never reached. That's okay, you don't need them anyway, but proper indentation of your code might help to avoid misunderstandings.

      Thank you very much Graff! Yes, saving the @SCOPclass array is intentional. One more question though: could you kindly explain:

      $pdbString[$#pdbString] .= $_;

      What do the $ followed by # represent? As I am still new at Perl I get confused between these symbols! Thanks for all your help :)

        The string contained in $_ is being joined to the string in the last element of @pdbString This code could also be written as:
        $pdbString[@pdbString - 1] .= $_;
        or
        $pdbString[-1] .= $_;
        Which I find simpler to read (negative index references the array starting from the end).

        $#pdbString is the index of the last element in array @pdbString. There are more examples and explanations in perldata.

        Yes, saving the @SCOPclass array is intentional.

        ... and so the fact that you never use the saved values would seem to be unintentional?

Re: Problems with Variable Scope in Perl
by MadraghRua (Vicar) on Nov 03, 2008 at 19:00 UTC
    You can actually do a lot of this more efficiently in BioPerl. So go to CPAN and search for Bundle::BioPerl. There are several types of BioPerl bundels available, but you just want this one for now - the core modules.

    You can then code something like this:

    use Bio::SeqIO; $in = Bio::SeqIO->new(-file => '$inputfile', -format => 'fasta'); while (my $seq = $in->next_seq() ) { print "fasta tag: ", $seq->id(), "sequence: ", $seq->sequence(), "\n"; }

    I think you can figure out how to do the rest of your stuff once you have the fasta tag and the sequence.

    In case you don't know about it, check out BioPerl for tutorials, How Do I's and script examples.

    Hope this helps...

    MadraghRua
    yet another biologist hacking perl....

      Thank you Graff, Ruzam, and MadraghRua for your comments. I understand things a bit better now :) InfoSeeker