Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I would not normally consider myself a newbie, but based on my current problem, I must have a long way to go. Basically, I'm trying to read two files and sort the columns of the second file according to those of the first. However, I'm having an issue looping through the header line of the two files. If I print the lines with
print "@firstLine1\n"; print "@firstLine2\n";
everthing prints just fine. Also if I use a foreach loop on the arrays individually, everything prints fine. But if I use a nested loop (because I want to search for a value in the second loop), the last item in the array is "missing." Strangely, if I print the value using the debugger, the value is there...hope that makes sense. Here's my code:
open my $fileText, '<', $file or die("Error opening \"$ARGV[0]\": $! " +); my @firstLine1; my $count = 0; while(<$fileText>){ chomp; if($count++ == 0){ # I will eventually read the whole file... @firstLine1 = split(/\t/); last; } } open my $fileText2, '<', $ARGV[1] or die("Error opening \"$ARGV[1]\": +$! "); my @firstLine2; $count = 0; while(<$fileText2>){ chomp; if($count++ == 0){ # I will eventually read the whole file... @firstLine2 = split(/\t/); last; } } print "@firstLine1\n"; # All values print fine! print "@firstLine2\n"; # All values print fine! foreach my $sample1 (@firstLine1){ foreach my $sample2 (@firstLine2){ # I get some strange output here. # Everything will print fine -> # $sample1: <sample1_value> $sample2: <sample2_value> # until the last value in the array, # everything gets messed up -> # sample1:$sample2: <sample2_value> print "\$sample1: \"$sample1\"\t\$sample2: \"$sample2\"\n"; } # If I try a compare, it fails if($sample1 eq $sample2){ # doesn't get here on the last value. } }
Any help is appreciated. Thanks in advance.

Replies are listed 'Best First'.
Re: Misunderstood array behavior
by wfsp (Abbot) on Sep 20, 2008 at 07:05 UTC
    If you add use strict; use warnings; near the top of your code perl complains that
    Global symbol "$sample2" requires explicit package name at...
    This is because you are doing the compare outside the inner loop ( where $sample2 is out of scope). Move the if block inside the inner loop and it will run. Fix that first and if it still won't run as expected show us what the first line in both files look like.
Re: Misunderstood array behavior
by GrandFather (Saint) on Sep 20, 2008 at 07:12 UTC

    $sample2 is local to the inner loop but is tested outside the inner loop - it doesn't exist there. use strict would have told you about that unless you have another lexical $sample2 who's scope is global to the for loops.

    Generally when you want to perform this sort of matching task in Perl you should first think "hash". Consider:

    use strict; use warnings; my $file1Data = "1\t2\t3\t4"; my $file2Data = "5\t6\t7\t4"; open my $fileText, '<', \$file1Data; my @firstLine1 = split /\t/, <$fileText>; close $fileText; open my $fileText2, '<', \$file2Data; my %firstLine2Fields = map {$_ => 1} split /\t/, <$fileText2>; foreach my $sample1 (@firstLine1) { print "Matched $sample1\n" if exists $firstLine2Fields {$sample1}; }

    Prints:

    Matched 4

    Perl reduces RSI - it saves typing
Re: Misunderstood array behavior
by AnomalousMonk (Archbishop) on Sep 20, 2008 at 07:29 UTC
    The other Usual Suspect in a split situation is a trailing split character in the input string, possibly with whitespace after it.

    E.g., if one of the strings you are splitting looks like "foo\tbar\tbaz\t" (note the trailing \t at the end), then you will have an empty string as the final string in the split output array.

    The other suggestion I would make would be to lose the confusing code construct

    while(<$fileText>){ chomp; if($count++ == 0){ # I will eventually read the whole file... @firstLine1 = split(/\t/); last; } }
    in favor of something like
    chomp($_ = <$file_handle>); # read, chomp one line my @split_fields = split /\t/;
    and eventually read the whole file separately.
Re: Misunderstood array behavior
by jethro (Monsignor) on Sep 20, 2008 at 11:40 UTC

    May I suggest a change to your testing code:

    # print "@firstLine1\n"; # All values print fine! MAYBE # print "@firstLine2\n"; # All values print fine! print '##',join('##',@firstLine1),"##\n"; print '##',join('##',@firstLine2),"##\n";

    Provided there are no '##' in your lines (showing those lines would have helped since your code is fine apart from the issue mentioned already) this will show you exactly how your arrays look like. Even better is the CPAN module Data::Dumper, especially when your data structures become more complex:

    use Data::Dumper; # print "@firstLine1\n"; # All values print fine! MAYBE print Dumper(@firstLine1);

    PS: You could test your script with only one file i.e.  yourscript filex filex. If you still get a missing value in the output, the script is to blame, otherwise your data

      Even better is the CPAN module Data::Dumper
      Not only is it a CPAN module, but it is also a "Core Module". This means that it is part of the Perl distribution and does not have to be separately downloaded and installed. It also means that you can use [doc://Data::Dumper] to link to the Perl doc, like so: Data::Dumper. You probably knew all this, but just in case others were unaware...
        Actually I didn't know this. Since I use linux distributions that make it easy to add lots of non-core modules to the installed perl at installation, the distinction between core and non-core is in practice replaced by distribution and non-distribution
Re: Misunderstood array behavior
by Anonymous Monk on Sep 20, 2008 at 14:53 UTC

    Thank you everyone for your suggestions. I need to clarify a little more.

    wfsp and GrandFather: I apologize for misplacing the 'if' statement. I accidentally pasted it outside the inner loop, but in my code it is inside the inner loop. The condition works in every instance, except on the last item in either array.

    GrandFather: I thought about using a hash, but I need to gather the files in order (by column) so that I can correctly order the second one. I cannot think of how to do that with a hash. It seems that a 2D array would be optimal. Do you a suggestion on how to do it with a hash?

    AnomolousMonk: What I mean by the comment about "eventually" reading the whole file, I mean that I will eventually read it into a 2D array during that loop, but I simplified it for the posting. However, it still has the same behavior as is. I kept the loop to maintain what I would do later. Is there a better way to read in each row and column?

    jethro: Thank you for suggesting Dumper, I was not aware of it. It also perfectly shows my problem. When it prints the last item in both arrays, it's all messed up:

    #### BEGIN ####
    $VAR215 = 'MS02-19196-A6-DCIS';
    $VAR216 = 'MS02-19196-A6-INVASIVE';
    $VAR217 = 'MS01-9167-A7-DCIS';
    ';AR218 = 'MS06-1878-D2-DCIS
    #### END ####

    That is exactly how it prints. Also, when it gets to the if condition, the condition fails. However, at that very moment, I can print the value in the debugger with "p $sample1"

    Anomolous Monk suggested that it could be a problem with extra tab(s) at the end of the line, but I have double checked that. This is really confusing to me. I also tried another file that is totally unrelated to what I'm doing, and it had the same behavior.

    I would like to post the file, but it has 218 columns and I don't see a way to upload it.

    Thanks for your help.

      $VAR217 = 'MS01-9167-A7-DCIS'; ';AR218 = 'MS06-1878-D2-DCIS

      If this is exactly what you get from data dumper, then there is a carriage return at the end of the line (hex 0D).

      It might mean that you use a msdos file on unix and your chomp only removes the Line Feed and not the Carriage return . See the man page of chomp and its dependance on $/. Setting $/ to "\r\n" would correct that, but then real unix files would not work. If you need both file types to work, use a regex instead of chomp

      About GrandFathers suggestion: Is the ordering of both files important to the result? If not you might put the second file into a hash instead of the first. But if you want helpful answers to that question you might open a new thread and tell us exactly what you want to do with those two files

        For Pete's sake...I never would have suspected that because it seemed to be stomping on memory. I've dealt with these different line endings before, but never ran into that behavior.

        Thanks a ton for everyone who helped. I have to mention that this has been the most pleasant forum I've ever worked with. Thanks!

        By the way, I'm using tchomp (http://cpan.uwinnipeg.ca/htdocs/Text-Chomp/Text/Chomp.pm.html) to solve the problem. Do you see any reason not to always use tchomp in place of chomp?

      #### BEGIN #### $VAR215 = 'MS02-19196-A6-DCIS'; $VAR216 = 'MS02-19196-A6-INVASIVE'; $VAR217 = 'MS01-9167-A7-DCIS'; ';AR218 = 'MS06-1878-D2-DCIS #### END ####
      This is why I always recommend $Data::Dumper::Useqq in such situations. Putting quotes some kind of delimiters around the variables you want to debug is of course a good thing, but if you're dealing with *lines* and have a problem, just use
      use Data::Dumper; $Data::Dumper::Useqq = 1; # shows all non-printable characters print Dumper \@lines;
      (I also prefer to dump a reference, this avoids the big mess of many $VAR314159...)
      edit: I even have a useful mapping for vim on my homenode which lets you debug with only very few keystrokes. (for emacs it looks a bit more complicated)
        Thank you. That is a valuable tip.

        Do you see any reason not to always use tchomp in place of chomp?