hotel has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

I am new to Perl as I am new to this website. I actually can be considered new to programming even. I've been working on a Perl script for the last few days and now even though it works fine (as far as i understand from my outputs) it displays a warning that i cannot figure out the reason. Below you can find a part of the text file (the file consists of this motif only) that my script parses and my the piece of my code that seems to be the trouble maker.

The warning message i get from the command prompt is this:

.....(the same warning with different line numbers) ...........

Argument "" isn't numeric in numeric le (<=) at C:\Users\kullaniciadi\Desktop\pl

asmodium\plasm-backup.pl line 92, <geneREAD> line 2454.

Argument "" isn't numeric in numeric le (<=) at C:\Users\kullaniciadi\Desktop\pl

asmodium\plasm-backup.pl line 92, <geneREAD> line 2463.

Argument "" isn't numeric in numeric le (<=) at C:\Users\kullaniciadi\Desktop\pl

asmodium\plasm-backup.pl line 92, <geneREAD> line 2463.

......(the same warning with different line numbers)

Here is the input file:

Gene: PF14_0747

TABLE: Epitopes from IEDB

Epitope Sequence Location on Protein Strain Confidence

26850 IKND 1914-1917 Plasmodium falciparum 3D7 Medium

------------------------------------------------------------

Gene: PF14_0711

TABLE: Epitopes from IEDB

Epitope Sequence Location on Protein Strain Confidence

26850 IKND 9-12 Plasmodium falciparum 3D7 Medium

------------------------------------------------------------

and here is the essential piece of my code

my $ctr; while($line2=<geneREAD>) { $ctr=0; chomp($line2); if($line2=~ m/^Gene:\s*(.*)/) { $geneName= $1; #print "$geneName\n"; if(grep(/^$geneName$/, @geneArr)) { while(substr($line2, 0 , 3) ne "---") { #print "$line2\n"; $line2=<geneREAD>; chomp($line2); if($line2=~ m/(\d*)\s*([A-Z]*)\s*(\d*)-(\d*)/) { $epitope= $1; $epiSeq= $2; $locBeg=$3; $locEnd=$4; foreach $snpID(@ {$snpHash{$geneName}}) { #print "$snpID $HoProPos{$snpID}\n"; if($locBeg <= $HoProPos{$snpID} && $HoProPos{$ +snpID} <= $locEnd) { if($ctr== 0) {print OUT "Gene: $geneName\n";} print OUT "SNP $snpID found in epitope $ep +itope at protein position $HoProPos{$snpID}\n"; $ctr=1; } } #print "$1 $2 $3 $4\n" } } } } }
I do not understand how the line consisting of dashes enter the if in which i (should only) have two numeric values. Thank you for your time.

ps: Please warn me if i could not formulate the question well enough or if i didnt provide enough information/code or if i provided more than required, as this is my first post.

Replies are listed 'Best First'.
Re: Argument "" isn't numeric in numeric le (<=)
by gone2015 (Deacon) on Aug 24, 2009 at 15:41 UTC
    I do not understand how the line consisting of dashes enter the if in which i (should only) have two numeric values.

    The problem is that the regular expression in:

    if($line2=~ m/(\d*)\s*([A-Z]*)\s*(\d*)-(\d*)/)
    accepts anything that has at least one '-' in it. Remember that \d* accepts zero or more digits, and \s* accepts zero or more white-space characters, etc. So this regex will match zero or more digits, followed by zero or more white-space characters, followed by zero or more [A-Z] characters, etc. The only thing that has to be present is a '-' to get a match.

    I suspect that you would be perfectly happy with:

    if($line2=~ m/(\d+)\s*([A-Z]+)\s*(\d+)-(\d+)/)
    where you are using the regex to do two things at once: (a) recognise the lines that contain the information you wish to process further, and (b) parse those lines to extract that information.

    If you are supremely confident that (a) the file you are processing always contains correctly formed lines, and (b) that your code recognises those correctly formed lines, then all will be well. In general I think it is wise to check the lines that are being rejected by the regex and warn about any whose format is not recognised. It's extra work to start with, but can save your bacon if some huge file at some future date contains broken data or stuff in a form you haven't catered for.

      Thank you for your comments and replies. I fixed the code after the first message by changing the * to +, and it works fine.

      ps: oshall, thank you for your advices. I try to follow most of them when i'm dealing with large files.

      But I still do not understand why Perl throws this warning for the dashed lines which do not even go into the loop in which the comparison takes place, instead of pointing to the lines that cause the problem?

        That's a different question...

        ...the while(substr($line2, 0 , 3) ne "---") loop will certainly stop when $line2 is dashed. However, you enter the loop with $line2 set to the "Gene:" line you just processed, and at the top of the loop you read the next line. So, a dashed line is processed in the loop, and then brings the loop to a halt.

        The inner loop could be recast:

        while ($line2=<geneREAD>) { chomp $line2 ; if (substr($line2, 0 , 3) eq "---") { break ; } ; .... } ;
        ...mind you, you might want to check that the "Gene:" line is followed by something ? But that is part of the general problem of verifying the input.

Re: Argument "" isn't numeric in numeric le (<=)
by Anonymous Monk on Aug 24, 2009 at 14:20 UTC
    * means zero or more times, and almost anything can match zero times, which is why one or more of $1 $2 $3 $4 is undefined

    You probably want to investigate the BioPerl modules

      No, not undefined. If the \d in /(\d*)/ matched zero times, $1 would be the empty string.

      You found the problem (since the OP is having problems with an empty string), it's just the description that's flawed.

      >perl -we"my $x = '' < 2" Argument "" isn't numeric in numeric lt (<) at -e line 1. >perl -we"my $x = undef() < 2" Use of uninitialized value in numeric lt (<) at -e line 1.
        My apologies ikegami, how ignorant of me to use undefined when clearly it wasn't numeric as per the OP.

      Thank you very much, my little script works pretty fine right now. I should've checked it more carefully before posting such a thing. Thank you.

Re: Argument "" isn't numeric in numeric le (<=)
by GrandFather (Saint) on Aug 24, 2009 at 21:57 UTC

    Deeply nested code should raise a red flag - things are getting out of hand and are likely to be difficult to understand. There are several ways to address the issue including removing some of the code into a subroutine and 'early exit'.

    I find early exit particularly good for reducing nested if blocks. Consider this refactored version of your code using early exits:

    use warnings; use strict; my %HoProPos = (1 => 1916); my %snpHash = (PF14_0747 => [1]); my @geneArr = qw(PF14_0747); while (my $line = <DATA>) { my $header = 1; chomp $line; next if $line !~ m/^Gene:\s*(.*)/; my $geneName = $1; next if ! grep (/^$geneName$/, @geneArr); while (substr ($line, 0, 3) ne "---") { $line = <DATA>; chomp $line; next if $line !~ m/(\d+)\s*([A-Z]+)\s*(\d+)-(\d+)/; my ($epitope, $epiSeq, $locBeg, $locEnd) = ($1, $2, $3, $4); foreach my $snpID (@{$snpHash{$geneName}}) { next if $locBeg > $HoProPos{$snpID} || $HoProPos{$snpID} > + $locEnd; print "Gene: $geneName\n" if $header; print "SNP $snpID found in epitope $epitope at protein pos +ition $HoProPos{$snpID}\n"; $header = undef; } } } __DATA__ Gene: PF14_0747 TABLE: Epitopes from IEDB Epitope Sequence Location on Protein Strain Confidence 26850 IKND 1914-1917 Plasmodium falciparum 3D7 Medium ------------------------------------------------------------ Gene: PF14_0711 TABLE: Epitopes from IEDB Epitope Sequence Location on Protein Strain Confidence 26850 IKND 9-12 Plasmodium falciparum 3D7 Medium ------------------------------------------------------------

    Note the use of if as a statement modifier for short statements. chomp doesn't need () - by convention () are not used for Perl's built in functions.

    Note too that $cnt has changed name to reflect its actual semantics and that variables are declared in the smallest sensible scope.

    Although you haven't shown your file handling code, I bet you are using the two parameter version of open and possibly aren't checking the result. You certainly aren't using lexical file handles. So, you should use the three parameter version of open, check the result and use lexical file handles. Your open should then look something like:

    open my $inFile, '<', $dataFileName or die "Unable to open $dataFileNa +me: $!\n";


    True laziness is hard work
      my $resultFile= "C://Users/kullaniciadi/Desktop/plasmodium/results.txt +"; open (OUT, ">>$resultFile") or die ("Cannot open $resultFile"); open(geneREAD, $plasData) or die ("Cannot open $plasData"); my ($geneName, $line2, $epitope, $epiSeq, $locBeg, $locEnd); my $ctr; while($line2=<geneREAD>) { $ctr=0; chomp($line2); if($line2=~ m/^Gene:\s*(.*)/) { $geneName= $1; if(grep(/^$geneName$/, @geneArr)) { while(substr($line2, 0 , 3) ne "---") { $line2=<geneREAD>; chomp($line2); if($line2=~ m/(\d+)\s+([A-Z]+)\s+(\d+)-(\d+)/) { $epitope= $1; $epiSeq= $2; $locBeg=$3; $locEnd=$4; foreach $snpID(@ {$snpHash{$geneName}})#call some +values parsed earlier in the code from an array of hashes and compare + them with locBeg & locEnd { if($locBeg <= $HoProPos{$snpID} && $HoProPos{$ +snpID} <= $locEnd) { if($ctr== 0)#this $ctr thing is used to pr +int out gene's name only once in the output file {print OUT "Gene: $geneName\n";} print OUT "SNP $snpID found in epitope $ep +itope at protein position $HoProPos{$snpID}\n"; $ctr=1; } } } } } } }

      Well, this is the part with the file handle you are mentioning. I only changed *s to +s in the inner if as I told you. I somehow do not believe that the warnings thrown pointing to dashed lines (in the .* version of the code) of the doc has anything to do with this filehandle, since it looks pretty simple and easy.

      Thank you for your help. Looks much more better with new declarations and changes. Hopefully I'll get used to it over time.

Re: Argument "" isn't numeric in numeric le (<=)
by Marshall (Canon) on Aug 24, 2009 at 23:45 UTC
    I really didn't understand completely what you are trying to do. like what is $HoProPos{$snpID}?

    Anyway 7 levels of "}" is very confusing to say the least and should be a super Red Flag.

    Below I just settled for parsing the input and making a data structure. It might give you some hints on how to simplify your code. I didn't understand the basic terminology so the parsing of the line is probably more complex than need be.

    A main point is that the depth level below is only 2 vs your 7 levels. The code below looks a bit strange because I don't understand the terminology or what you are trying to accomplish. I was not able to run your code.

    Have fun with this...I hope that something is useful for you.

    #!usr/bin/perl -w use strict; use Data::Dumper; my %Gene_DB; # a hash of hash while (<DATA>) { my $gene = (/^Gene:\s+(\S+)\s*/)[0] || next; add_gene($gene); } sub add_gene { my $gene = shift; my @tokens; my %gene_hash; while (<DATA>) { last if (/^-/); #End of Record next unless (/^\d/); #the single line we care about! @tokens = split; } #I probably don't understand OP right! #but first 3 on line, last 2 on line, then what's left over my ($Epitope, $Sequence, $Location) = splice(@tokens, 0, 3); my ($Strain, $Confidence) = splice(@tokens, -2); my ($Protein) = join (" ",@tokens); @gene_hash{'Epitope', 'Sequence', 'Location' ,'Strain', 'Confidence' ,'Protein'} = ($Epitope, $Sequence, $Location,$Strain, $Confidence,$Protein); $Gene_DB{$gene}=\%gene_hash; } print Dumper (\%Gene_DB); #prints: #$VAR1 = { # 'PF14_0747' => { # 'Protein' => 'Plasmodium falciparum', # 'Epitope' => '26850', # 'Confidence' => 'Medium', # 'Strain' => '3D7', # 'Location' => '1914-1917', # 'Sequence' => 'IKND' # }, # 'PF14_0711' => { # 'Protein' => 'Plasmodium falciparum', # 'Epitope' => '26850', # 'Confidence' => 'Medium', # 'Strain' => '3D7', # 'Location' => '9-12', # 'Sequence' => 'IKND' # } # }; # __DATA__ Gene: PF14_0747 TABLE: Epitopes from IEDB Epitope Sequence Location on Protein Strain Confidence 26850 IKND 1914-1917 Plasmodium falciparum 3D7 Medium ------------------------------------------------------------ Gene: PF14_0711 TABLE: Epitopes from IEDB Epitope Sequence Location on Protein Strain Confidence 26850 IKND 9-12 Plasmodium falciparum 3D7 Medium ------------------------------------------------------------

      Dear Marshall, I do not know if you're interested in this code at all :) but let me clarify a little bit. Let me start with explaining what $HoProPos{$snpID} is.

      $HoProPos is a hash as can be understood, which contains positions of snps(single nucleotide polymorphisms) in a protein structure. snpIDs are key values of this hash and since every snp ID is unique I thought it's a good way to hash them for fast and easy access to their locations in the proteins through their names. now, where do I get this data at all, its not in the above code? I get/parse this data from another inputfile somewhere earlier in the code which i thought posting would be unnecessary.

      so, the thing I was trying to accomplish is basically. Parsing the first input file, storing snpIDs, in what genes (@geneArr) they are located at, and at what position of the that protein ($HoProPos{snpID}} (translated form of that gene) they are occurring.

      Then I open another file (the file above, in the original post) which has information about genes and the epitopes they contain. And I check if the gene name exists in my @geneArr which contains genes with snps inside, if exists, I check the location of the corresponding snp to see whether if that snp is located somewhere in between the start or stop point of the epitope (($locBeg <= $HoProPos{$snpID} && $snploc <= $HoProPos{$snpID})).

      As i said, the code works pretty fine right now. The only thing I could not understand as I said, when Perl threw warnings, the warnings were pointed to the lines with dashes. If I know a little bit programming the dashed lines never went into the if in which the comparison took place (before or after the substitution of .* with .+). So, I don't know why Perl pointed to those lines.

      Thank you for your time.
        Dear Hotel,

        I was genuinely trying to help you. Without you explaining what $HoProPos{snpID} is, I would have no way to understand it.

        I would make it a goal to have the code run "clean" without warnings being produced. At this instant in time, I am unable to provide the "magic answer" to your regex question. I will consider it and let you know if I figure it out.

        You should be aware that my post was meant in a positive way.