Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I am opening a file and counting pattern matches. Is this the most efficient way to do it?
open(DATA, "$errlog") || die "Can not open: $!\n"; my @dat = (<DATA>); close(DATA); open(DATA, ">>$errlog") || die "NO GO: $!\n"; foreach (@dat) { if($_ =~ /patternmatchinghere/gi) { $counter++; } } close(DATA);

Replies are listed 'Best First'.
Re: Opening file and checking for data
by tcf22 (Priest) on Jul 03, 2003 at 13:24 UTC
    Grep would probably be a better way to do this.
    my $count = grep /match/, <DATA>; print $count; __DATA__ match123123 match3123 nomat not34234 match 4232434
    This outputs 3.
    or my golfing solution(a little shorter):
    print scalar(grep /match/, <DATA>);

    Update: Fixed my grammar.
Re: Opening file and checking for data
by gellyfish (Monsignor) on Jul 03, 2003 at 13:33 UTC

    Well of course benchmarking may prove me wrong but I would have thought that slurping the contents into the array and then iterating over the array is going to be less efficient than simply reading the file line by line like:

    ... while(<DATA>) { ... }
    You probably also want to avoid using DATA as a filehandle as this is a predefined handle setup when Perl initializes pointing to the the stuff after an __END__ or __DATA__ at the end of the program. It doesn't break anything but might be confusing to someone reading the program later.

    /J\
    

      I haven't done benchmarking, either, but I've seen them in the past. Slurping is by far faster than processing line-by-line. Reason is that doing I/O in a single operation is faster than doing a bit at a time, since you don't have to worry about things like resetting the drive head to the correct position.

      Naturally, you have to worry about memory limitations. There is really a limited number of cases when slurping is worth it. If your file is small enough to fit in memory, you won't see much speed gain. If it's too large, you'll end up swapping to the hard disk and will thus lose any benifits from slurping.

      ----
      I wanted to explore how Perl's closures can be manipulated, and ended up creating an object system by accident.
      -- Schemer

      Note: All code is untested, unless otherwise stated

        Perl's I/O is buffered, so it does one I/O for every disk block regardless of which method you use.
        #!/usr/bin/perl if ($ARGV[0] eq 'line') { print "Line-at-a-time\n"; while (<STDIN>) { print if /perl/; } } else { print "All at once\n"; my @arr = (<STDIN>); foreach (@arr) { print if /perl/; } }
        On my system, the block size is 4096 bytes. On an 8K file with 128 lines, we see:
        $ strace -e read /tmp/t29 line </tmp/t29.8192 >/dev/null
        ...
        read(0, "This is a line that contains the"..., 4096) = 4096
        read(0, "This is a line that contains the"..., 4096) = 4096
        read(0, "", 4096)                       = 0
        
        $ strace -e read /tmp/t29 slurp </tmp/t29.8192 >/dev/null
        ...
        read(0, "This is a line that contains the"..., 4096) = 4096
        read(0, "This is a line that contains the"..., 4096) = 4096
        read(0, "", 4096)                       = 0
        
        Still, the diamond operator takes some time to operate, so slurping is still probably faster, but not because of I/O.

        And of course the caching algoritms of your hard-drive and OS will have a big influence as well. If the file is small enough to be read in one go, slurping will not have any big added speed-benefit, but will increase memory load.

        And if the file is rather large, it may crowd out other items in your cache and slow down other programs: TANSTAAFL, as Heinlein said.

        CountZero

        "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

Re: Opening file and checking for data
by halley (Prior) on Jul 03, 2003 at 13:58 UTC
    Nobody so far has mentioned your /g option on the regular expression check. If you're just looking for a yes/no on each line, then you can drop the /g. If you want to count multiple occurrences in the same line (such as 'l's in 'hello wally') then you will need to do something like catch m/.../g into an array, then count those elements.

    --
    [ e d @ h a l l e y . c c ]

Re: Opening file and checking for data
by Zaxo (Archbishop) on Jul 03, 2003 at 14:46 UTC

    To count multiple matches in a line, you can realize that m//g returns the matches in array context. That means that you can just say $counter += () = /patternmatchinghere/gi; for the counting.

    I doubt if your error log is appended to your script after __END__ or __DATA__, so you should avoid the *DATA handle.

    My rewrite:

    my $counter = 0; open LOG, "< $errlog" or die 'Can not open: ', $!; $counter += () = /patternmatchinghere/gi while <LOG>; close LOG;
    I've omitted opening $errlog to append, since I don't see what it does for you in this snippet.

    Update: changed for to while to avoid slurping.

    After Compline,
    Zaxo

      Thanks for all the replies.
Re: Opening file and checking for data
by DBX (Pilgrim) on Jul 03, 2003 at 14:07 UTC
    If you are using a string in your regular expression that will not change on each iteration of your loop, consider adding the /o operator like so:
    if($_ =~ /patternmatchinghere/gio)
    This will compile the regular expression only once, as opposed to once for every loop iteration. On a large amount of data, this will speed up your code significantly.

      /o is dead, long live qr//!

      ----
      I wanted to explore how Perl's closures can be manipulated, and ended up creating an object system by accident.
      -- Schemer

      Note: All code is untested, unless otherwise stated

        Good point. I honestly forget about qr// most of the time, but this was a good reminder.
Re: Opening file and checking for data
by hmerrill (Friar) on Jul 03, 2003 at 13:40 UTC
    I can't see anything wrong with your code, and 'tcf22's example looks fine too. Although I've never used it, for comparing which of a few different ways is most efficient, the 'Benchmark' module might be helpful. Do
    perldoc Benchmark
    at a command prompt to see how to use it.

    HTH.
      In the Perl Cookbook recipe 3.9 'High-Resolution Timers' looks pretty useful for timing code - it uses the Time::HiRes module which comes standard in Perl 5.8:
      use Time::HiRes qw(gettimeofday); $t0 = gettimeofday; ### your code here ### $t1 = gettimeofday; $elapsed = $t1 - $t0; # elapsed is a floating point value, representing the # number of seconds between $t0 and $t1
      then do the same thing for the *other* way, and see which one takes the least amount of time.

        If you want to benchmark your code, use Perl's Benchmark module. There's no need to reinvent it. Super Search should find numerous examples of how to use it.

        Update: I see you have already mentioned Benchmark. In future, I must read more carefully before repying. Bad me.

Re: Opening file and checking for data
by msemtd (Scribe) on Jul 04, 2003 at 13:20 UTC
    I'm surprised that nobody yet mentioned not using Perl at all! What you want to do is possible with Gnu grep from the command line...
    grep --count thepattern whateverfile
    Of course, this may not meet your requirements - notably, this returns a count of the lines matched per file rather than pattern matches per file. YMMV