There are many ways to read the last line of a file in perl, but as usual, not all are equal.

#! perl -slw use strict; use Benchmark qw[ cmpthese ]; use Tie::File; use File::ReadBackwards; for our $file ( qw[ data/500k.dat data/1000k.dat data/2MB.dat ]) { print "\nComparing $file"; cmpthese( -3, { 'Tie::File' => q[ my( @lines, $last ); tie @lines, 'Tie::File', $file or die $!; $last = $lines[ -1 ]; untie @lines; # print "TF:$last"; ], 'File::ReadBackwards' => q[ my $last; tie *FILE, 'File::ReadBackwards', $file or die $!; $last = <FILE>; untie *FILE; # print "RB:$last"; ], rawio => q[ my( $last, $buffer ); open FILE, '< :raw', $file or die $!; sysseek FILE, -1000, 2; sysread FILE, $buffer, 1000; $last = substr $buffer, 1+rindex( $buffer, "\n", length($b +uffer)-2 ); close FILE; # print "raw:$last"; ], readfwd => q[ my $last; open FILE, '<', $file or die $!; $last = <FILE> until eof FILE; close FILE; # print "RF:$last"; ], }); } __END__ P:\test>354830 Comparing data/500k.dat Rate Tie::File readfwd File::ReadBackwards + rawio Tie::File 5.12/s -- -94% -99% + -100% readfwd 90.0/s 1657% -- -88% + -99% File::ReadBackwards 775/s 15042% 762% -- + -95% rawio 14765/s 288372% 16314% 1805% + -- Comparing data/1000k.dat Rate Tie::File readfwd File::ReadBackwards + rawio Tie::File 2.51/s -- -95% -100% + -100% readfwd 45.9/s 1730% -- -95% + -100% File::ReadBackwards 854/s 33931% 1759% -- + -94% rawio 15214/s 605971% 33011% 1681% + -- Comparing data/2MB.dat Rate Tie::File readfwd File::ReadBackwards + rawio Tie::File 1.25/s -- -95% -100% + -100% readfwd 22.9/s 1736% -- -98% + -100% File::ReadBackwards 1040/s 83151% 4434% -- + -93% rawio 15087/s 1207959% 65694% 1351% + --

Note: The figures are for 3 files (25 character lines), on one OS (Win32XP) and perl 5.8.3. YMMV.


Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail

Replies are listed 'Best First'.
Re: Reading from the end of a file. (broken)
by tye (Sage) on May 20, 2004 at 15:54 UTC

    Note that your fastest case is also the only one that doesn't handle long lines. I find it a pretty poor benchmark to reimplement a module's logic badly and then wonder if it runs faster. Fix the 'rawio' case to handle things as well as File::ReadBackwards (and the other cases) do(es) and it'll be closer in speed. I'm sure it will still be faster, since it doesn't use tied handles or actually deal with a general purpose problem.

    But I'd still use File::ReadBackwards or "tail" since I don't find "get last line of a file" to be something I care to heavily optimize on the few occasions when I do it and the code to properly handle all of the boundary cases is not something I care to maintain on its own.

    - tye        

      S'funny, that's pretty much exactly what I said here. I only included rawio it for completeness.


      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "Think for yourself!" - Abigail
Re: Reading from the end of a file.
by fireartist (Chaplain) on May 20, 2004 at 13:22 UTC
    <second thought>
    Whenever I've thought of doing any bechmarking involving file i/o, I've always been stumped as to how I could be sure that filesystem caching didn't mess up the results.
    How can I be sure that doing a test X number of times will give results comparable to executing the code once in a real program? With such extremely different methods of reading I can only imagine that it's even harder to be sure that caching isn't helping / hindering individual tests.
    </second thought>

    <first thought>
    Fairly impressive differences there!
    </first thought>

      Re: Your second thought.

      It's a fair point, but the stats show comparative differences which means that only the first pass of the file is penalised, as the file will then be in the cache for all subsequent passes.

      In the case of the figures shown, the case affected was File::ReadBackwards (by virtue of Benchmark running the testcases in alpha sorted order by name). As File::ReadBackwards managed to process the file at least 700+ times in the alloted 3 seconds of cpu, regardless of the filesize, the affect of the penalty for putting the file into the cache on the first pass is minimal. However, to preclude the possibility of any affect, I added the following line at the top of the for loop

      ( undef ) = do{ local $/; open my $fh, '< :raw', $file or die $!; <$fh> };

      so as to preload the cache. The results of the re-run were nearly identical--certainly within the bounds of normal variance.

      P:\test>354830 Comparing data/500k.dat Rate Tie::File readfwd File::ReadBackwards + rawio Tie::File 5.15/s -- -95% -99% + -100% readfwd 94.4/s 1734% -- -88% + -99% File::ReadBackwards 783/s 15101% 729% -- + -95% rawio 15058/s 292394% 15852% 1824% + -- Comparing data/1000k.dat Rate Tie::File readfwd File::ReadBackwards + rawio Tie::File 2.50/s -- -94% -100% + -100% readfwd 43.7/s 1650% -- -95% + -100% File::ReadBackwards 871/s 34777% 1893% -- + -94% rawio 14917/s 597126% 34023% 1612% + -- Comparing data/2MB.dat Rate Tie::File readfwd File::ReadBackwards + rawio Tie::File 1.24/s -- -95% -100% + -100% readfwd 23.6/s 1797% -- -98% + -100% File::ReadBackwards 1051/s 84542% 4363% -- + -93% rawio 14894/s 1198858% 63119% 1316% + --

      There's no real magic about why the differences are so great. Tie::File and the readfwd cases are having to read the entire file to get the last line. Additionally, Tie::File is doing a huge amount of work under the covers with buffering the whole file through a limited buffer space and a hash. This extra work is incredibly useful when you are using it for the purposes for which it was designed, but this is the wrong purpose.

      File::ReadBackwards skips to the end of the file and (unsurprisingly:) reads backwards in a similar fashion to the rawio case, but it carries the overhead of tie. It also is properly coded to handle the IO in a cross platform manner and handle any length of line rather than relying on a hardcode maximum line length and assuming that "\n" will do the 'right thing' as my crude rawio case does.

      For production work where performance wasn't the ultimate criteria, I would use File::ReadBackwards in preference to trying to fix up the rawio case.


      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "Think for yourself!" - Abigail
        There's no need to do any "prerunning" to avoid penalties for a first run. If the first argument to timethese (and hence, to cmpthese) is negative, Benchmark will run the code for at least that number of seconds. But in order to know how many times the code needs to be run, it will first run the code several times to get an indication how often it needs to run to satisfy the requirement. So, any first run penalties have already been paid.

        Of course, if there's a significant difference between a first run and any subsequent runs, the use of the Benchmark module is isn't very useful anyway.

        Abigail