abhishes has asked for the wisdom of the Perl Monks concerning the following question:

Dear Perl Monks,
I am facing a strange problem. I wrote a program like
print "opening file 1\n"; myfile(); print "opening file 2\n"; myfile(); print "opening file 3\n"; myfile(); sub myfile { open FILE, "xml.log"; my @lines = <FILE>; close(FILE); }

When I run my code, the output is
opening file 1
opening file 2

Then my program hangs! Why does it hang? I did close the file in my first call to the function? When I call the function the second time I should get a new file handle...shouldn't I?

Please help .... I am confused. I am using active state perl 5.6.1 on Windows 2000.

regards,
Abhishek.

Replies are listed 'Best First'.
Re: opening a file in a subroutine
by pg (Canon) on Feb 09, 2003 at 17:59 UTC
    The key point to improve performance in this case, is to avoid element-by-element array replication. The following solution is straight forward, simple, easy to understand, and more importantly, it is fast, and its performance is steady.

    Tested other solutions with 5.8/win98, this seems to be the fastest.
    use strict; for (1..10) { my $t0 = time(); my $lines = read_log(); print "Used ", time() - $t0, "\n" } sub read_log { open(FILE, "<", "test.txt"); my @lines = <FILE>; close(FILE); return \@lines }
      Nope.

      This code suffers from the same bug as the other code.

      Notice that he said perl 5.6 on Win2k.


      Updated

      Also tested other solutions, with 5.6 on win98, this yielded the best performance.

      Please to show your benchmarks.

      Benchmark: running list_io, split_slurp, while_io, each for at least 5 + CPU seconds... list_io: 5 wallclock secs ( 3.89 usr + 1.34 sys = 5.24 CPU) @ 17 +21.98/s (n=9018) split_slurp: 4 wallclock secs ( 3.02 usr + 2.21 sys = 5.23 CPU) @ 2 +725.38/s (n=14251) while_io: 7 wallclock secs ( 3.64 usr + 1.38 sys = 5.02 CPU) @ 16 +29.26/s (n=8174) Rate while_io list_io split_slurp while_io 1629/s -- -5% -40% list_io 1722/s 6% -- -37% split_slurp 2725/s 67% 58% --
      Don't trying to be confused, and think local $/, read in and then split would improve performance.

      In light of the evidence I think you will have to reconsider.

      (As I will explain, whether you use scalar context has not much to do with performance.

      Depends what you mean. Reading in smaller chunks at a time reduces memory overhead and can thus have a signifigant effect on run time.

      On the other hand, keep in mind, split is not free, it requires to walk thru the whole string. As anyone has a c background would know, string operation does hurt performance a lot, especially this kind of operation that invloves head to toes.)

      You would think this on face glance. As I said the evidence contradicts you.

      Perhaps its due to perl being able to allocate one buffer sysread the lot and then walk the string. It may in fact be that this is more efficient than reading whatever the standard size buffer is for PerlIO, scaning it for new lines, then reading another buffer.... (Assuming of course memory is available)

      First layer, the physical reading layer, Perl would read in block by block, doesn't matter whether your code requires a scalar context or array context. This makes sense, it is optimized for Perl's own performance.

      By block by block presumably you mean buffer by buffer.

      Second layer, the layer between Perl and your program. Perl would present the data in the right context, as you required. This layer doesn't involve physical devices, and is much less related to performance than the first layer does.

      You have the return type part of context mixed up with the actions that the context causes to happen. In list context the IO operator does something different to what it does in scalar context. In list context it causes the entire file to be eventually read into mememory and sliced up into chunks as specified by $/.

      In scalar context it reads enough to provide a single chunk at a time. If the chunks are small it may read more chunks into memory than one, but it doesn't necessarily load them all.

      I will agree that I was suprised myself however about these results. But you cant just say that something works the way it does, and that it should be faster, because you think so. Unless you have poured over the PerlIO code and unless you have benchmarked the issue in question rather exhaustively you have no way to know howfast something is going to run in Perl.

      --- demerphq
      my friends call me, usually because I'm late....

        However I do clearly see a positive sign here, your idea is changing, it is good for everyone to improve themselves thru discussions, and change ideas. I do, you do, everyone does, that's why we all love this site!!
Re: opening a file in a subroutine
by demerphq (Chancellor) on Feb 09, 2003 at 14:34 UTC
    I see no reason this should hang. You said "a program like this". Does this exact code hang on the file in question? How big is that file? Have you somehow used up all your memory and the OS is thrashing?

    --- demerphq
    my friends call me, usually because I'm late....

      thanks for your reply.

      The file is only one 1.5 MB. the first function gets completes within a matter of seconds. Then the second function call takes a very long time to complete. I was wrong with I said that the program hangs... but the second and the third function calls take 5 minutes each to complete. This doesn't make sense to me... because if the first call took just 5 - 6 seconds why did the second and third take such a long amount of time?

        My guess is that the code that you are actually using looks more like
        sub read_file { my $file=shift; open my $fh,$file or die "$file : $!"; my @lines=<$fh>; return \@lines; }
        In which case its storing the data in memory. Which is perhaps overflowing your available physical ram. When that happens the OS starts swapping the memory out to disk (obviously a slow operation), which if it happens enough leads to a condition called thrashing where the OS is basically just moving memory back and forth from the disk, and the time taken for the repeated swapping completely overwhelms the time your code takes. Causing your code to look like it hangs. Try using some memory monitoring tool, or do some program analyiss to see exactly how much memory you are actually using. Hashes for instance consume far far more memory than they look. As does careless array manipulation. For instance if you have
        my @array; $array[100_000_000]=1;
        then perl will have to allocate sufficient RAM for all 100 million slots, even though they arent used.

        Without seeing real code, (as i said I dont think what you posted properly reflects what you are doing) its hard to say for certain.

        Regards,

        UPDATE: Since you are on Win2k you should take advantage of the Task manager and the Profiler that come with the OS. (Start -> Settings -> Control Panel -> Administrative Tools -> Profiler)

        --- demerphq
        my friends call me, usually because I'm late....

Re: opening a file in a subroutine
by pg (Canon) on Feb 10, 2003 at 12:48 UTC
    It is now my 4:11 am early morning, make it crystal clear, I am not back for some heated discussion. I am back to clear up the water, and demo why slurp a file is not a good practice. The moment I got the idea, I felt obligatory to test it out, and share the result with monks.

    We all read the bench mark from demerphq. His data clearly showed us that to slurp a file by reading in as one line and then split is much faster than all other solutions.

    But what does that mean? I has been thinking about that since I saw his result. I couldn't really get it until the moment that I told myself: yes, it only means that the reading/slurping ITSELF is fast, but hold the file in memory could seriously slow down the application as a whole, and make the speed gained at the slurping time not just nothing, but something will bite us. Then I decided to design a test case, to demo this.

    All what I need to do, is to cause PAGING. To do that, I don't really need some huge file, I only need it big enough, so that together with other parts of the application, it would cause PAGING.

    This time I decided to come back with SOLID DATA, not just like what I did last time, with blahblah... ;-)

    I first wrote this piece to prepare my testing data:
    test_pre.pl: use strict; open(DATA, ">", "test.dat"); foreach (1..$ARGV[0]) { print DATA "$_\n"; } close(DATA);
    Then I made up two pieces of simple program, one to slurp with split (test_slurp.pl) and then DO SOMETHING, when the other one read in the file line by line (test.pl) and then DO EXACTLY THE SOMETHING. The whole point is DO SOMETHING, not just read, one would NEVER read a file without using it. Also this SOMETHING has to be simple, straight, and to be something that could happen EVERY day:
    test_slurp.pl: use strict; use constant SIZE => 10000; my $t0 = time(); my $lines = read_log_s(); my @data = (0..SIZE - 1); foreach my $line (@{$lines}) { $line += ($data[$line % SIZE] - $data[($line + 1) % SIZE]); } print "Used ", time() - $t0, "\n"; sub read_log_s { local $/; open(FILE, "<", "test.dat"); my @lines = split /\n/, <FILE>; close(FILE); return \@lines; } test.pl: use strict; use constant SIZE => 10000; my $t0 = time(); my @data = (0..SIZE - 1); open(FILE, "<", "test.dat"); my $line; while ($line = <FILE>) { $line += ($data[$line % SIZE] - $data[($line + 1) % SIZE]); } close(FILE); print "Used ", time() - $t0, "\n";
    I tried with
    perl -w test_pre.pl 100000
    
    created a file test.dat, which is only 688K. Then I tried both test_slurp.pl and test.pl with this file on my win98. test_slurp used 9, when test only used 2. (Well, my PC is slow, sorry about that ;-)

    The whole testing is not something complex, as you can see, it is really simple. The only idea is to DO SOMETHING. As we are now clear, when the slurping itself was much faster, it easily caused the whole application to significantly slow down.

    (If you want to repeat my testing, you may need to adjust the numbers to cause PAGING, depending on your PC configuration and OS).

    ...I have to go back to sleep ;-)

    Tested with perl -w test_pre.pl 1000000, file size 8- M, test.pl use 30, test_slurp.pl core dump after running for a while (tried twice). Now I am really leaving... ;-)