ringleader has asked for the wisdom of the Perl Monks concerning the following question:

Sorry about the length of this, but since its all of a piece... well, here goes.

I am having a problem with (i think) the outer loop in this code snippet.

Description The point of the script is to loop through a file that has been split into a 2D array:

YAL038W 1.1     2.4     4.1     YCL040W 1.1     1.6     1.8     9.11    0.0402128119838095

such that the identifiers (in this case) are yal038w and ycl040w, and the rest are their 'values' (which may be of any length). While there are still digits found by my regexp (i.e. we haven't hit the other identitfier yet) keep trying to match $pathway_name$m (e.g. 1, 2, 26) to the value after the identifier. If matched, go to the next row of identifier/values. After all rows are done, go to the next pathway number and check.

for($m = 0; $m < scalar(@pathway_name); $m++) # foreach pathway name +given by an integer { print "Pathway is $pathway_name[$m]\n"; # e.g.pathway name is 1 for($j = 0; $j < $rows; $j++) # for each row in my 2- +D converted input file $file[][] { $n = 1; #print "Gene: $file[$j][0], Value: $file[$j][1]\n"; if($seen{$file[$j][0]}) # if the ind +exing element has already been seen { #print "Seen this Gene: $file[$j][0]\n"; # go on to t +he next row } elsif($file[$j][1]) # if the val +ue exists { $n = 1; VALUE:while($file[$j][$n] =~ /\b\.+\b/gi) # while the +re is still a digit there (i.e. not a name { #print "Gene: $file[$j][0], Value:$file[$j][$n]\n"; #print "Entering if statement in while block...\n"; if($file[$j][$n] =~ /\b\Q$pathway_name[$m]\E\.\d+/) # if +matches the combination of pathwayname.digit.digit { print OUTPUT "$file[$j][0]\t$pathway_name[$m]\n"; print "Gene: $file[$j][0], Value: $file[$j][$n] printed to o +utput file\n"; #print "Going to last value.\n"; $seen{$file[$j][0]} = 1; # add + the gene to the %seen hash for this pathway last VALUE; } else { #print "Gene: $file[$j][0], Value: $file[$j][$n] not matched +.\n"; $n++; # go + on to the next value } } # en +d while } # en +d elsif else { print "Skipped gene $file[$j][0]\n"; # g +o on to the next line } } # end + while ($j < $row) undef %seen; # since + going on to the next pathway, undefine %seen }

The Problem
The loop keeps skipping every second $m from the outer for loop. It enters the loop, prints out the pathway name, and then increments $m without apparently checking the if/else statements. For example, in the input line above, it will find 1.1 and 4.1, but not 2.4.

I'm using perl 5.6.1 for i386-linux if that matters, with strict, disgnostics, and warnings turned on.

<edited for cleanup on aisle three>

janitored by ybiC: Balanced <code> tags around longish codeblock, for less vertical scrolling

Replies are listed 'Best First'.
Re: Loop skipping
by bgreenlee (Friar) on Aug 11, 2004 at 15:08 UTC

    I tried to follow your code to figure out what was going on, but my head started hurting, so instead I'll offer a simpler way to do what I think you're trying to do (which I'm not really clear on, so I could be way off base). Assuming that you're trying to massage lines in this format:

    YAL038W 1.1 2.4 4.1 YCL040W 1.1 1.6 1.8 9.11 0.0402128119838095

    Into a structure like so:

    %hash = ('YAL038W' => [1.1, 2.4, 4.1], 'YCL040W' => [1.1, 1.6, 1.8, 9.11, 0.0402128119838095], ...);

    then I would do something like so with each line:

    my $cur_identifier = ''; chomp $line; foreach my $element (split /\s+/, $line) { if ($element !~ /\d+(?:\.\d+)?/) { # matches int or float numbers $cur_identifier = $element; } else { push @{$hash{$cur_identifier}}, $element; } }

    This splits the line on whitespace, then loops through the resulting list. Whenever it encounters an element that doesn't look like a number, it considers it the start of a new 'identifier' (hash key), and subsequent numbers are pushed onto the array referenced by that key, until the next identifier is reached.

    -b

      Incredibly sorry about the head-hurtingness, yet appreciative of the help :D
      Guess when i stare at my code all day, i don't realise that it makes sense only to me (that, and being primarily a biologist, doesn't help).

      I have many files with 10's of thousands of lines like these:
      YAL038W 1.1 2.4 4.1 YCL040W 1.1 1.6 1.8 9.11 0.0402128119838095 YDR132C 99 YDR223W 99 0.0085523710563531 YDL188C 01.05.04 10.03.01 40.01 42.0 43.01.03.05 YGL134W 01.05.04 02.1 +9 -0.0831302979427955

      I have these read into a 2-D array already.
      Now, with the code I posted, i am trying to see if, for each value of the identifier on the left side of the line, it contains a value that starts with a number from the list i have stored in an array ($pathway_name: these are simple integers like 1, 2, 3).
      That's where the awful while loop thingy came from. My code only has to work, not be pretty ;)
      The %seen hash refers to whether the identifier has already been seen on the iteration for this pathway value. I'm beginning to think that may be causing the problem, although there are no error messages displayed.

      My problem is, that when it comes to going on to the next $m in the outer for loop, it skips every second one. It will enter the for loop, but not the if/elsif/else statements.

      Is this any clearer?
        It would help if you showed us what output you are getting, and what output you are WANTING to get, for the given sample data you have shown us.

        Then we can write your application for you! :) Or at least show you where yours isn't working quite right.

Re: Loop skipping
by Solo (Deacon) on Aug 11, 2004 at 17:39 UTC
    Have you tried adding some print statements showing the values of your variables at each pass? This should help a lot with debugging (unless you are savvy with the debugger--but then why are you asking us! ;p) For instance, at loops and conditional statements, add something like:

    print <<END_DEBUG if $DEBUG; \$m = $m \$n = $n \$j = $j \$row = $row END_DEBUG

    The variables $m and $n are easy to mix up... if you didn't cut and paste your exact code, are you sure you haven't accidentally typed $m++ rather than $n++ somewhere? The debug output will help you locate mistakes that you can't find while staring at the code.

    (Don't forget to put somewhere near the top of your code)

    my $DEBUG = 1;

    --Solo
    --
    You said you wanted to be around when I made a mistake; well, this could be it, sweetheart.
Re: Loop skipping
by Sandy (Curate) on Aug 11, 2004 at 20:22 UTC
    To re-iterate previous posts, it would be easier if we could see the actual construct of the @file

    However...

    based on the comments that go along with your code, there are possibly two flaws:

    VALUE:while($file[$j][$n] =~ /\b\.+\b/gi) # while ther +e is still a digit there (i.e. not a name
    This does not do what it says. The condition will satisfy values of the form "word.word" where the period really is a period! It will satisyfy "1.2" because "1" is a word, and "2" is a word. It will NOT satisfy the number "99", and thus stop checking this gene altogether (even if there are more numbers after it). '99' will never satisfy the while condition!

    Secondly, the g modifier to your regular expression keeps track of the last position of the regular expression when it matched your variable.

    So the first time it tries to match $file[1][1] it matches, and puts it position reminder at the end of the string. When "pathway" is set to "2", and it tries to match $file[1][1] again, it starts at the end of the value, and returns false, never to test any of the other values.

    So, with the assumption that @file looks like (one value per $file[$row][$n]), and not that I also added a few more test conditions to your input data,

    my @file; push @{$file[0]}, qw (YAL038W 1.1 2.4 4.1 YCL040W 99.2 1.1 1.6 1.8 9.11 0.040212811); push @{$file[1]}, qw(YDR223W 99 4.7 0.0085523710); push @{$file[2]}, qw(YDL188C 1.5.4 10.3.1 13.32 YGL134W 01.05.04 02.19 0.083130297); $rows = 3; my %seen; @pathway_name = (1, 2, 4, 99);

    then I changed your VALUE:while... statement to: (notice I got rid of the gi qualifiers).

    VALUE:while($file[$j][$n] =~ /^[\.\d ]+$/)
    My results are: (do not expect 99 to satisfy your condition of digit.digit in a later if statement.
    Pathway is 1 Gene: YAL038W, Value: 1.1 printed to output file Gene: YDL188C, Value: 1.5.4 printed to output file Pathway is 2 Gene: YAL038W, Value: 2.4 printed to output file Pathway is 4 Gene: YAL038W, Value: 4.1 printed to output file Gene: YDR223W, Value: 4.7 printed to output file Pathway is 99
    Hopes this helps,

    Sandy

      You're right Sandy.
      I removed the g modifier about five minutes before you posted, and it worked.

      Nice job, lads :-D
      It's always something stupidly small with my code...
      Thanks for all the help; conversely, sorry bout all the headaches!
Re: Loop skipping
by johnnywang (Priest) on Aug 11, 2004 at 19:27 UTC
    I also tried hard to follow the code, and also got a headache. I'll second husker's comment, please give us a running program, i.e., the @pathway_name array, a short "@file" array, and indicate what you expact and what you are getting.

    Another suggestion: please use "use strict", it will catch many errors.