scain has asked for the wisdom of the Perl Monks concerning the following question:

Hello again.

On friday, I wrote this node asking for help in understanding how the <> operator works in a scalar context, and I gave test code and output that mimiced the problem I was seeing. However, now it seems that the sample code was too simple, so I've reworked the problem.

The orginal code was written to traverse several subdirectories getting one file per directory, parse some information and produce output to a central file. These subdirectories are usually numbered starting with 0 running up to 500 or 1000. The test code that I present below is designed to run in the main directory and get the approprate file in each subdirectory so that it can be opened, parsed, etc.

#!/usr/bin/perl -w use strict; my $EXON_IN; for (my $i=0; $i < 5; $i++){ $EXON_IN = <$i/*test.txt>; print "$i, -$EXON_IN-\n"; }
Some of the comments offered on the previous node on this topic focused on the fact that it was a constant string in the <>, so I have changed the test code to more accuately reflect the real situation. Also, another comment seemed to indicate (though I am honestly not sure I understand) that since this was the "same" fileglob each time through I was getting the alternating behavior shown by the output below. This fileglob is different in two respects though: it has the subdirectory in it ($i), and it has a wild card. It produceds the same sort of output though:
[scott@blast test]$ ./test.pl 0, -0/0_test.txt- Use of uninitialized value in concatenation (.) or string at ./test.pl + line 7. 1, -- 2, -2/2_test.txt- Use of uninitialized value in concatenation (.) or string at ./test.pl + line 7. 3, -- 4, -4/4_test.txt-
Here is the directory structure:
[scott@blast test]$ ls -R .: 0 1 2 3 4 test.pl ./0: 0_test.txt ./1: 1_test.txt ./2: 2_test.txt ./3: 3_test.txt ./4: 4_test.txt
Your continued help and guidence in this matter is greatly appricated.

Scott

Replies are listed 'Best First'.
Re: More Fileglob in scalar context question
by chipmunk (Parson) on Jul 02, 2001 at 19:44 UTC
    Here's a helpful explanation from perlop of this odd behavior of the glob operator:
    A glob evaluates its (embedded) argument only when it is starting a new list. All values must be read before it will start over. In a list context this isn't important, because you automatically get them all anyway. In scalar context, however, the operator returns the next value each time it is called, or a undef value if you've just run out. As for filehandles an automatic defined is generated when the glob occurs in the test part of a while or for - because legal glob returns (e.g. a file called 0) would otherwise terminate the loop. Again, undef is returned only once. So if you're expecting a single value from a glob, it is much better to say ($file) = <blurch*>; than $file = <blurch*>; because the latter will alternate between returning a filename and returning FALSE.
    So, even though you've changed the argument to this glob() in your loop, the glob doesn't notice that the argument has changed until it finishes with the argument it got the first time it was called. And it takes two iterations for the glob to finish that argument: once to get the matching file and once to realize that there are no more files.

    If you use the doc's suggested solution your problem should disappear!

      Chipmonk,

      Thank you. I can't figure out how I missed that in perlop, as I did spend some time trying to figure this out before posting.

      As for how to do this better (i.e., so it works), I posted a follow up to my node on Friday that indicated that I was more comfortable doing this in a list context anyway. That is, get all of the matching files before going into the loop and then using foreach to go through each of them.

      Thanks again,
      Scott

Re: More Fileglob in scalar context question
by runrig (Abbot) on Jul 02, 2001 at 19:39 UTC
    You will get the same alternating effect (because you're not finishing the glob cycle) even though the value in the glob is variable. You need something like:
    my $EXON_IN; for my $i (0..4){ # Cycle through all of them (including the undef) while (<$i/*test.txt>) { $EXON_IN = <$i/*test.txt>; } #OR put in list context ($EXON_IN) = <$i/*test.txt>; print "$i, -$EXON_IN-\n"; }
Re: More Fileglob in scalar context question
by kschwab (Vicar) on Jul 02, 2001 at 20:06 UTC
    Another option would be to skip the glob operator altogether, and use (open|read|close)dir instead.

    This would skip the odd glob behavior, and also get around some other glob() limitations you might hit later. ( like "too many args").

    Here's my shot at converting your short example:

    for (my $i=0; $i < 5; $i++){ opendir(DIR,$i) or die(); foreach my $file (readdir(DIR)) { # if the filename ends in "test.txt"... # adjust the regex as needed if ($file =~ /test\.txt$/) { print "$i, -${file}-\n"; } } closedir(DIR); }
    Update:per scain's comment, it looks like perl's glob was updated in 5.6.0 to use an internal routine on most implementations. OTOH, it looks like p5p is saying they will tie it to File::Glob, which sucks in Exporter, etc, probably slowing it down.

    As far as which one is faster, using the short example, readdir() is at least twice as fast on perl 5.6.0 and hundreds of times faster on perl 5.5.003 ( where glob() still calls out to csh). Here's what I tried, after making a few test directories and files:

    use Benchmark; timethese(10000, { 'readdir' => sub { my @blah; for (my $i=0; $i < 5; $i++){ opendir(DIR,$i) or die(); foreach my $file (readdir(DIR)) { if ($file =~ /test.txt$/) { push(@blah,$file); } } closedir(DIR); } }, 'glob' => sub { my @blah; for (my $i=0; $i < 5; $i++){ for (<$i/*test.txt>) { push(@blah,$_); } } } } ); Results on my perl 5.6.x box: Benchmark: timing 10000 iterations of glob, readdir... glob: 12 wallclock secs ( 3.24 usr + 2.22 sys = 5.46 CPU) @ 1831.50/s (n=10000) readdir: 3 wallclock secs ( 1.54 usr + 1.69 sys = 3.23 CPU) @ 3095.98/s (n=10000) on a 5.005_003 box (1000 iterations, because it's so slow): Benchmark: timing 1000 iterations of glob, readdir... glob: 216 wallclock secs ( 1.83 usr 9.21 sys + 63.22 cusr 107.38 csys = 0.00 CPU) readdir: 1 wallclock secs ( 0.43 usr + 0.37 sys = 0.80 CPU)
      kschwab,

      I was under the impression that glob worked in this way now anyway, as opposed to the old way using the shell. If it does work the way I think, the "too many args" thing should be a thing of the past, no? And if that is true, it seems like it might be faster to get all of the files with one glob, and then loop through them.

      Thanks,
      Scott