lairel has asked for the wisdom of the Perl Monks concerning the following question:

I keep getting errors about uninitialized variables and needing explicit declaration. I have tried so many things that I am lost and confused and can't remember what I have tried :( The goal of this program is to read in a file, and search the header for the name, then return only the name of the header and the non-header line. I have some of the attempts commented out, but this is what I currently have

#!/usr/bin/perl use warnings; use strict; use diagnostics; open( INFILE, "<", 'myosin.fasta') or die $!; #open original myosin.fa +sta for reading open( OUTFILE, ">", 'modifiedHeaders.fasta') or die $!; #open/create n +ew fasta for writing my %headerHash; #create hash for the header my %seq; my %header; #loop through old file while (<INFILE>){ chomp; my $line = $_; # use if/then to seperate headers from seq, while regex to select only + species name in header if ($line =~ /^>/){ my $header = $line; if ($header =~ /(>gi.*)\[(.+)\](>gi.*)\[(.+)\] +/){ my $headerHash; $headerHash{$1} = $2; # return $headerHash; } else{ my $seq = $line; # return $seq ; print OUTFILE $headerHash, "\n", $seq, "\n"; } # if ($header =~ /(>gi.*)\[(.+)\](>gi.*)\[(.+)\]/){ # my $headerHash = $2;} #print $headerHash # print OUTFILE $headerHash{$2}, "\n", $seq, "\n"; } } #print OUTFILE $headerHash, "\n", $seq, "\n";

here is an example of the input

>gi|115527082|ref|NP_005954.3| myosin-1 [Homo sapiens] >gi|226694176|s +p|P12882.3|MYH1_HUMAN RecName: Full=Myosin-1; AltName: Full=Myosin he +avy chain 1; AltName: Full=Myosin heavy chain 2x; Short=MyHC-2x; AltN +ame: Full=Myosin heavy chain IIx/d; Short=MyHC-IIx/d; AltName: Full=M +yosin heavy chain, skeletal muscle, adult 1 [Homo sapiens] >gi|119610 +411|gb|EAW90005.1| hCG1986604, isoform CRA_b MSSDSEMAIFGEAAPFLRKSERERIEAQNKPFDAKTSVFVVDPKESFVKATVQSREGGKVTAKTEAGATV +TVKDDQVFPM NPPKYDKIEDMAMMTHLHEPAVLYNLKERYAAWMIYTYSGLFCVTVNPYKWLPVYNAEVVTAYRGKKRQE +APPHIFSISD NAYQFMLTDRENQSILITGESGAGKTVNTKRVIQYFATIAVTGEKKKEEVTSGKMQGTLEDQIISANPLL +EAFGNAKTVR NDNSSRFGKFIRIHFGTTGKLASADIETYLLEKSRVTFQLKAERSYHIFYQIMSNKKPDLIEMLLITTNP +YDYAFVSQGE ITVPSIDDQEELMATDSAIEILGFTSDERVSIYKLTGAVMHYGNMKFKQKQREEQAEPDGTEVADKAAYL +QNLNSADLLK ALCYPRVKVGNEYVTKGQTVQQVYNAVGALAKAVYDKMFLWMVTRINQQLDTKQPRQYFIGVLDIAGFEI +FDFNSLEQLC INFTNEKLQQFFNHHMFVLEQEEYKKEGIEWTFIDFGMDLAACIELIEKPMGIFSILEEECMFPKATDTS +FKNKLYEQHL GKSNNFQKPKPAKGKPEAHFSLIHYAGTVDYNIAGWLDKNKDPLNETVVGLYQKSAMKTLALLFVGATGA +EAEAGGGKKG GKKKGSSFQTVSALFRENLNKLMTNLRSTHPHFVRCIIPNETKTPGAMEHELVLHQLRCNGVLEGIRICR +KGFPSRILYA

the desired output for that should be:

Homo sapiens MSSDSEMAIFGEAAPFLRKSERERIEAQNKPFDAKTSVFVVDPKESFVKATVQSREGGKVTAKTEAGATV +TVKDDQVFPM NPPKYDKIEDMAMMTHLHEPAVLYNLKERYAAWMIYTYSGLFCVTVNPYKWLPVYNAEVVTAYRGKKRQE +APPHIFSISD NAYQFMLTDRENQSILITGESGAGKTVNTKRVIQYFATIAVTGEKKKEEVTSGKMQGTLEDQIISANPLL +EAFGNAKTVR NDNSSRFGKFIRIHFGTTGKLASADIETYLLEKSRVTFQLKAERSYHIFYQIMSNKKPDLIEMLLITTNP +YDYAFVSQGE ITVPSIDDQEELMATDSAIEILGFTSDERVSIYKLTGAVMHYGNMKFKQKQREEQAEPDGTEVADKAAYL +QNLNSADLLK ALCYPRVKVGNEYVTKGQTVQQVYNAVGALAKAVYDKMFLWMVTRINQQLDTKQPRQYFIGVLDIAGFEI +FDFNSLEQLC INFTNEKLQQFFNHHMFVLEQEEYKKEGIEWTFIDFGMDLAACIELIEKPMGIFSILEEECMFPKATDTS +FKNKLYEQHL GKSNNFQKPKPAKGKPEAHFSLIHYAGTVDYNIAGWLDKNKDPLNETVVGLYQKSAMKTLALLFVGATGA +EAEAGGGKKG GKKKGSSFQTVSALFRENLNKLMTNLRSTHPHFVRCIIPNETKTPGAMEHELVLHQLRCNGVLEGIRICR +KGFPSRILYA

Replies are listed 'Best First'.
Re: While loop with nested if statements (updated)
by Athanasius (Archbishop) on Apr 03, 2016 at 06:36 UTC

    Hello lairel,

    When I run your code with perl -c, I get the following output:

    What this is telling you is that the variable $headerHash is being used where it hasn’t been declared. That’s because a variable declaration with my (which produces what is called a lexical variable) has a scope which is limited to the block where it occurs (see e.g. the tutorials in Variables and Scoping). So in this code:

    while (<INFILE>){ chomp; my $line = $_; ... if ($line =~ /^>/){ my $header = $line; if ($header =~ /(>gi.*)\[(.+)\](>gi.*)\[(.+)\]/){ my $headerHash; # A $headerHash{$1} = $2; } else{ my $seq = $line; print OUTFILE $headerHash, "\n", $seq, "\n"; # B } } }

    the declaration at point A has gone out of scope by the time the variable is referenced at point B. To fix this, you need to give the variable a wider scope, by declaring it before you enter the if/else construct:

    while (<INFILE>){ chomp; my $line = $_; my $headerHash; ... if ($line =~ /^>/){ my $header = $line; if ($header =~ /(>gi.*)\[(.+)\](>gi.*)\[(.+)\]/){ $headerHash{$1} = $2; } else{ my $seq = $line; print OUTFILE $headerHash, "\n", $seq, "\n"; } } }

    But while this will remove the compile error, it doesn’t make a lot of sense: you print out the value of $headerHash only if its value has not been initialised! Perhaps you need to declare $headerHash before the while loop? In any case, a line such as:

    print OUTFILE $headerHash ...

    will only result in output like this:

    16:17 >perl -wE "my $h = { a => 1, e => 2 }; print STDOUT $h;" HASH(0x3ac320) 16:30 >

    which is not what you want. Please supply a sample input file, together with the output you want to produce from that file (see How do I post a question effectively?). This will help the monks to understand what you’re really trying to achieve.

    Update:

    Until I read Marshall’s reply, I hadn’t noticed that %headerHash is already declared before the loop; I was reading the line $headerHash{$1} = $2; as though it were this: $headerHash->{$1} = $2; — i.e., treating $headerHash as a hash reference — which of course it isn’t. :-(

    lairel, just to emphasise the point Marshall is making: in Perl, $headerHash, @headerHash, and %headerHash are separate, completely unrelated variables. $headerHash{$key} references one element of the hash %headerHash, and is unrelated to the scalar variable $headerHash.

    Hope that helps,

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

Re: While loop with nested if statements
by kcott (Archbishop) on Apr 03, 2016 at 10:09 UTC

    G'day lairel,

    "I keep getting errors about uninitialized variables and needing explicit declaration."

    Those error messages should tell you the variable involved and other information: use this to track down your problem. If you need help with any particular error message(s), you'll need to show us the text also.

    "I have tried so many things that I am lost and confused ..."

    You need to understand fundamental constructs, such as while loops. I suggest you read "perlintro -- a brief introduction and overview of Perl".

    Consider this technique for accessing fasta files for this type of work:

    #!/usr/bin/env perl use strict; use warnings; { local $/ = "\n>"; while (<DATA>) { $_ = substr $_, 1 if $. == 1; my ($head, $data) = split /\n/; if ($head eq $ARGV[0]) { print "Found! Head: '$head'; Data: '$data'\n"; last; } } } __DATA__ >head1 data1 >head2 data2 >head3 data3

    Some test runs:

    $ pm_1159403_fasta.pl head1 Found! Head: 'head1'; Data: 'data1' $ pm_1159403_fasta.pl head2 Found! Head: 'head2'; Data: 'data2' $ pm_1159403_fasta.pl head3 Found! Head: 'head3'; Data: 'data3'

    See also: $/ - input record separator; $. - input line number.

    — Ken

Re: While loop with nested if statements
by Marshall (Canon) on Apr 03, 2016 at 13:36 UTC
    If you had some example input, I could perhaps run the code. Right now all I can see are the compiler errors.

    However, some things look odd to me:

    my $headerHash; $headerHash{$1} = $2;
    You have already declared my %headerHash; outside of the loop. Here $headerHash is declared as a scalar, not a hash! Perl does allow different name spaces for hash vs scalar. You can have both a scalar and a hash named "headerHash", but in this case, I think this is a bad idea. I suggest you change this scalar version of $headerHash to something else. Your code is very confusing.

    Update: Even if you move my $headerhash; above the while loop, On line 28, print OUTFILE $headerHash, "\n", $seq, "\n";, that will create a runtime error because there is nowhere that I can see where this scalar $headerhash is assigned any value, it will be "undef".

    Maybe you are confused about hash syntax? A hash like %headerhash is accessed with 2 things, a key and a value, like $hash{$key}=$value.

    Be careful, something like %hash=55; doesn't do what you think!! This increases the size of the hash, i.e. more buckets. This doesn't assign a value to the hash. Consider:

    #!/usr/bin/perl use strict; my %hash; #defaults to 8 buckets $hash{a}=33; #use one of the buckets my $buckets = %hash; print "$buckets\n"; #prints 1/8, 1 of 8 buckets used keys(%hash)=32; #increase size of hash to 32 buckets $buckets = %hash; print "$buckets\n"; #prints 1/32 1 of 32 buckets used
    Perl starts a new hash with a default of 8 buckets. When it needs to grow the hash, it doubles the number of buckets. 8,16,32,64,etc. I have bench marked presetting big hashes to big bucket sizes to prevent this auto re-sizing, but found out that this makes almost no difference in performance. Perl is surprisingly efficient at this transparent operation. It is best to just let Perl "do its thing" without trying to overly "help it". I just mention this here to show some perhaps error that could produce some very unexpected results if you botch the hash assignment syntax.

      I've added an example of the input and output to my original post, and my code is currently looking like this, but again I am still not sure I am using the hash correctly, and the hash key is really throwing me. I'm not getting any errors with this code, but my output file is empty

      #!/usr/bin/perl use warnings; use strict; use diagnostics; open( INFILE, "<", 'myosin.fasta') or die $!; #open original myosin.fa +sta for reading open( OUTFILE, ">", 'modifiedHeaders.fasta') or die $!; #open/create n +ew fasta for writing my %headerHash; #create hash for the header my $seq; my $header; #loop through old file while (<INFILE>){ chomp; my $line = $_; # use if/then to seperate headers from seq, while regex to select only + species name in header if ($line =~ /^>/){ $header = $line; if ($header =~ /(>gi.*)\[(.+)\](>g.*)\[(.+)\]/ +){ my $headerHash = $2; $headerHash{$2} = $2; } } else { $seq = $line; } } close INFILE; for my $key(keys %headerHash){ print OUTFILE $headerHash{$key}, "\n", $seq, "\n"; }
        I still see multiple problems. Can you show what you expect the output lines to be? I guess there is some input now on the OP. I get now:
        Global symbol "$headerHash" requires explicit package name at C:\Proje +cts_Perl\anotherfastathing.pl line 30. Execution of C:\Projects_Perl\anotherfastathing.pl aborted due to comp +ilation errors (#1) (F) You've said "use strict" or "use strict vars", which indicates + that all variables must either be lexically scoped (using "my" or +"state"), declared beforehand using "our", or explicitly qualified to say which package the global variable is in (using "::"). Uncaught exception from user code: Global symbol "$headerHash" requires explicit package name at C:\P +rojects_Perl\anotherfastathing.pl line 30. Execution of C:\Projects_Perl\anotherfastathing.pl aborted due to +compilation errors. Process completed with exit code 255
        Update: Ok, this looks like what you want, why not?
        #!/usr/bin/perl use warnings; use strict; use diagnostics; my $firstline = <DATA>; my ($species) = $firstline =~ /\[(.+?)\]/; print "$species\n"; while (<DATA>) {print;} =Prints: Homo sapiens MSSDSEMAIFGEAAPFLRKSERERIEAQNKPFDAKTSVFVVDPKESFVKATVQSREGGKVTAKTEAGATV +TVKDDQVFPM NPPKYDKIEDMAMMTHLHEPAVLYNLKERYAAWMIYTYSGLFCVTVNPYKWLPVYNAEVVTAYRGKKRQE +APPHIFSISD NAYQFMLTDRENQSILITGESGAGKTVNTKRVIQYFATIAVTGEKKKEEVTSGKMQGTLEDQIISANPLL +EAFGNAKTVR NDNSSRFGKFIRIHFGTTGKLASADIETYLLEKSRVTFQLKAERSYHIFYQIMSNKKPDLIEMLLITTNP +YDYAFVSQGE ITVPSIDDQEELMATDSAIEILGFTSDERVSIYKLTGAVMHYGNMKFKQKQREEQAEPDGTEVADKAAYL +QNLNSADLLK ALCYPRVKVGNEYVTKGQTVQQVYNAVGALAKAVYDKMFLWMVTRINQQLDTKQPRQYFIGVLDIAGFEI +FDFNSLEQLC INFTNEKLQQFFNHHMFVLEQEEYKKEGIEWTFIDFGMDLAACIELIEKPMGIFSILEEECMFPKATDTS +FKNKLYEQHL GKSNNFQKPKPAKGKPEAHFSLIHYAGTVDYNIAGWLDKNKDPLNETVVGLYQKSAMKTLALLFVGATGA +EAEAGGGKKG GKKKGSSFQTVSALFRENLNKLMTNLRSTHPHFVRCIIPNETKTPGAMEHELVLHQLRCNGVLEGIRICR +KGFPSRILYA =cut __DATA__ >gi|115527082|ref|NP_005954.3| myosin-1 [Homo sapiens] >gi|226694176|s +p|P12882.3|MYH1_HUMAN RecName: Full=Myosin-1; AltName: Full=Myosin he +avy chain 1; AltName: Full=Myosin heavy chain 2x; Short=MyHC-2x; AltN +ame: Full=Myosin heavy chain IIx/d; Short=MyHC-IIx/d; AltName: Full=M +yosin heavy chain, skeletal muscle, adult 1 [Homo sapiens] >gi|119610 +411|gb|EAW90005.1| hCG1986604, isoform CRA_b MSSDSEMAIFGEAAPFLRKSERERIEAQNKPFDAKTSVFVVDPKESFVKATVQSREGGKVTAKTEAGATV +TVKDDQVFPM NPPKYDKIEDMAMMTHLHEPAVLYNLKERYAAWMIYTYSGLFCVTVNPYKWLPVYNAEVVTAYRGKKRQE +APPHIFSISD NAYQFMLTDRENQSILITGESGAGKTVNTKRVIQYFATIAVTGEKKKEEVTSGKMQGTLEDQIISANPLL +EAFGNAKTVR NDNSSRFGKFIRIHFGTTGKLASADIETYLLEKSRVTFQLKAERSYHIFYQIMSNKKPDLIEMLLITTNP +YDYAFVSQGE ITVPSIDDQEELMATDSAIEILGFTSDERVSIYKLTGAVMHYGNMKFKQKQREEQAEPDGTEVADKAAYL +QNLNSADLLK ALCYPRVKVGNEYVTKGQTVQQVYNAVGALAKAVYDKMFLWMVTRINQQLDTKQPRQYFIGVLDIAGFEI +FDFNSLEQLC INFTNEKLQQFFNHHMFVLEQEEYKKEGIEWTFIDFGMDLAACIELIEKPMGIFSILEEECMFPKATDTS +FKNKLYEQHL GKSNNFQKPKPAKGKPEAHFSLIHYAGTVDYNIAGWLDKNKDPLNETVVGLYQKSAMKTLALLFVGATGA +EAEAGGGKKG GKKKGSSFQTVSALFRENLNKLMTNLRSTHPHFVRCIIPNETKTPGAMEHELVLHQLRCNGVLEGIRICR +KGFPSRILYA