Cody Pendant has asked for the wisdom of the Perl Monks concerning the following question:

I just found this in the FAQ:
How do I count the number of lines in a file?

One fairly efficient way is to count newlines in the file. The following program uses a feature of tr///, as documented in perlop. If your text file doesn't end with a newline, then it's not really a proper text file, so this may report one fewer line than you expect.


$lines = 0; open(FILE, $filename) or die "Can't open `$filename': $!"; while (sysread FILE, $buffer, 4096) { $lines += ($buffer =~ tr/\n//); } close FILE;

This may well be a stupid question but why might that be preferred to this:
open(X,filename); while(<X>;){} $lines = $.;

--
“Every bit of code is either naturally related to the problem at hand, or else it's an accidental side effect of the fact that you happened to solve the problem using a digital computer.” M-J D

Replies are listed 'Best First'.
Re: How To Count Lines In File?
by jmcnamara (Monsignor) on Jan 27, 2003 at 00:00 UTC

    The FAQ method is more efficient because it is reading in fixed block sizes which it doesn't have to split into lines.

    Here is a benchmark to show the difference. I added wc for comparison:

    $ time wc -l bigfile 91420 bigfile real 0m0.016s user 0m0.010s sys 0m0.006s $ time perl faq.pl bigfile 91420 real 0m0.032s user 0m0.027s sys 0m0.004s $ time perl cody.pl bigfile 91420 real 0m0.105s user 0m0.098s sys 0m0.008s

    Here is the way I usually don't do it, this is twice as slow as the slowest method above:     perl -le 'print $==()=<>' file

    --
    John.

    Update: Added benchmark.

Re: How To Count Lines In File?
by tomhukins (Curate) on Jan 27, 2003 at 00:13 UTC

    Here's an answer to your question, instead of yet another way to count the number of lines in a file. ;-)

    The solution mentioned in the FAQ runs much faster. Run the following code:

    #!/usr/bin/perl use strict; use warnings; use Benchmark qw(timethese); my $filename = '/usr/share/dict/words'; timethese(100, { 'read_block' => sub { open(FILE, $filename) or die "Can't open file: $!"; my $lines = 0; while (read FILE, my $buffer, 4096) { $lines += ($buffer =~ tr/\n//); } close FILE; }, 'read_line' => sub { open(FILE, $filename) or die "Can't open file: $!"; while (<FILE>) {}; my $lines = $.; close FILE; } });

    So, why does this happen? Well, the read_line approach above must read the file one byte at a time in case it encounters a line ending. The read_block approach reads a block of data from the disk and processes it within the Perl process, not needing to make any operating system calls.

    The significance of 4096 is that disk block sizes are usually some multiple of 1024 bytes, so reading complete blocks helps the code run faster than if it were to read partial blocks.

Re: How To Count Lines In File?
by Abigail-II (Bishop) on Jan 27, 2003 at 00:28 UTC
      Very interesting!

      So, the meaning of -p is performed by textually including the loop around the actual code, so beginning with "}" will close the -p's while loop, and the final "{" will ballance the closing brace at the end of the expansion.

      I never thought about that. I always just figured that built-in looping construct was done on a syntactic boundary, like:

      while (<>) { eval $option_e; print }
      only done in the parse-tree level, not by simply generating more source text!

      So... that's not documented or to be relied on, right?

      —John

        `perl --help' or `perldoc perlrun'


        MJD says you can't just make shit up and expect the computer to know what you mean, retardo!
        ** The Third rule of perl club is a statement of fact: pod is sexy.

      Hi ,

      The code you have here, does it work only in Unix , I got the contents of the file in WINDOWS instead of showing the number of lines.Please explain if this is for Unix alone or why in windows it acts this way
        The command shell in windows can't handle the single quotes (') shown in that oneliner. Change them to double quotes (") and it should work for you.
        perl -lpe "}{*_=*.}{" file


        ryddler
      Is there any gain in using globs over the variables themselves, such as:
      perl -lpe '}{$_=$.' file
      ?

      On another similar note, last semester my C++ students had to write a small C++ program that averages numbers from a flatfile (one per line). I wowed them with this little doosie:

      perl -lpe '$s+=$_}{$_=$s/$.' file
      UPDATE
      That's a fair cop! :)

      jeffa

      L-LL-L--L-LL-L--L-LL-L--
      -R--R-RR-R--R-RR-R--R-RR
      B--B--B--B--B--B--B--B--
      H---H---H---H---H---H---
      (the triplet paradiddle with high-hat)
      
        Is there any gain in using globs over the variables themselves

        If you can answer the question "is there any gain in using perl -ple '}{*_=*.}{' file over wc -l file", you can figure out the other question yourself.

        Abigail

Re: How To Count Lines In File?
by Aristotle (Chancellor) on Jan 27, 2003 at 02:55 UTC
    Instead of the FAQ's rather verbose loop, you can achieve the same effect using
    { local ($/, $_) = (\4096); $lines += tr/\n// while <$fh>; }

    Makeshifts last the longest.

Re: How To Count Lines In File?
by John M. Dlugosz (Monsignor) on Jan 27, 2003 at 06:01 UTC
    Funny this should come up—I was just thinking about the first time I ever used Perl, and contemplated writing a Meditation on it.

    My first Perl program ever was very much your simple way to count lines. Perhaps I didn't even know about $. and incremented a counter in the body of the loop.

    The interesting part is: I never got a result. It ran so slowly on my PC (a 16 MHz 80386SX I beleive) that I killed it when it was taking too long to finish.

    I was disapointed that it ran so slowly and was not useful. But, when Perl was young and AWK was all the rage, didn't all computers have speeds in that order of magnitude? Perhaps the disk IO was eating it alive due to a poor or immature 32-bit environment that had to trap to real-mode DOS on every file-read call.

    So... when Perl was young and the FAQ was being written, perhaps the buffer-at-a-time approach was significantly better performing. The logic of <FILE> to read up to the next newline might have been primitive in the early days, and changed when memory became cheap and buffers were no big deal.

    —John

Re: How To Count Lines In File?
by broquaint (Abbot) on Jan 27, 2003 at 12:43 UTC
    In the spirit of TIMTOWTDI
    perl -e 'print @{[<>]}.$/' some_file_here

    HTH

    _________
    broquaint

      You slurp the file though - and you still pay the read-line-by-line penalty. Here's one that slurps as well but counts using tr.
      $ perl -lp0777e'$_=tr/\n//' foo

      Makeshifts last the longest.

Re: How To Count Lines In File?
by ibanix (Hermit) on Jan 26, 2003 at 23:16 UTC
    If you're on a Unix box, or a Win32 box with the GNU tools installed, you can always
    my $lines = `wc -l $filename`; $lines =~ /(d+)\s/; $lines = $1;


    Cheers,
    ibanx

    $ echo '$0 & $0 &' > foo; chmod a+x foo; foo;
      But you shouldn't. You're neither checking whether your attempt to run wc succeeded nor whether the pattern matched. Under these circumstances, $lines may wrong be assigned whatever was in $1 from a previous match. Don't use $1 blindly. Always guard your matches with an if or assign the captures to a list. See Gilimanjaro's snippet.

      Makeshifts last the longest.

      I think I should point out before Monks chime in with 347 different ways to count the lines in a file, that my question is, more specifically, "why is this featured in the FAQ as the suggested way?" is it better, and in what way, file I/O, speed, what?


      --
      “Every bit of code is either naturally related to the problem at hand, or else it's an accidental side effect of the fact that you happened to solve the problem using a digital computer.” M-J D

      Or even:

      my ($lines) = `wc -l $filename` =~ /(\d+)/;
Re: How To Count Lines In File?
by Anonymous Monk on Jan 27, 2003 at 12:43 UTC
    sysread of a fixed block (adjusting the 4096 argument to match the blocksize used by your OS provides an additional optimization, BTW) is more efficient than asking perl to do that in the background and then scan until \n into sub- buffers returning them one at a time. tr just scans once straight thru the whole buffer, and returns the count as a side effect.
Re: How To Count Lines In File?
by Limbic~Region (Chancellor) on Jan 27, 2003 at 18:11 UTC
    Your questions was why the FAQ solution was preferred to another method that you described - other people have already answered and gave benchmark data to support it.

    The only reason I am contributing to what seems like a complete thread is because I ran into a similar dilema that had real world impact. In trying to solve my problem (which is too long to get into here), I decided to check out the Unix Reconstruction Project at Perl Power Tools and found that the tcgrep was blazing fast in comparison to anything I was capable of writing.

    I spent several days ripping out lines of code until I found what I was looking for and splicing it into my code with a few more optimizations for my very specific environment and was able to actually beat the compiled Unix grep (albeit in a very specific race).

    So - if you are trying to count lines, words, characters, paragraphs, or a few other things - I would suggest checking out this.

    UPDATE: PPT's port of wc does not use the ultra streamlined version of counting lines in a file, but it does offer all kinds of other support such as UTF support for word counting, etc - that is why I felt it was worth mentioning!

    Cheers - L~R

Re: How To Count Lines In File?
by pg (Canon) on Jan 27, 2003 at 18:00 UTC
    I am thinking that you may want to play with that bible number 4096. Base on the combination of your os and platform, you might find a different bible number fits better. Can try something like 4096 *2, 4096 * 4, 4096 * 8 etc.

    My guess is that, as long as you don't cause paging, bigger number would improve performance more. But if the number is bigger enough to cause paging, then it would hurt you instead.