vector40 has asked for the wisdom of the Perl Monks concerning the following question:

Hey, folks; new Perl user, just tinkering. This strikes me as something that should be easy, but it's behaving peculiarly.

What I want is a brief script that takes a text file, sifts through it according to a regex string, and count off how many times it finds said string. It prints this, telling me how many times it finds it (I've got it set up so that it reports back after each find, but it doesn't much matter; you could wait until it's done).

Observe: (beware, lots of painfully explicit stuff in here to aid debugging)

#!/usr/bin/perl -w use strict; use diagnostics; print "Do some regex now, shall we?\n"; print "Give me a file. Full path, please.\n"; chomp(my $filepath = <STDIN>); print "The filename is $filepath\n"; open FILE, "$filepath" or die "Couldn't open it. Error: $!\n"; my $count = 0; # Initialize the variable, just to be safe while (<FILE>) { if (/foo/) { $count++; print "$count so far.\n"; } } print "Program exiting\n";

Here's the problem. This sort of works... but only if each instance of the string is on a different line. This makes sense, I suppose; after all, it's taking each line individually. But if it's got four occurences of the string on one line, then it will count them all as one, which isn't too helpful.

Thoughts?

Thanks,

- Brandon

Replies are listed 'Best First'.
Re: Counting text strings
by Enlil (Parson) on Mar 04, 2003 at 05:58 UTC
    You could change this code:
    while (<FILE>) { if (/foo/) { $count++; print "$count so far.\n"; } }
    to something like this:
    while (<FILE>) { while (/foo/g) { $count++; print "$count so far.\n"; } }
    update: or something like this:
    while (<FILE>) { $count++ while /foo/g; } print $count;

    -enlil

      Or something slightly different yet the same
      while ( <FILE> ) { $count += $_ =~ s/foo/foo/g; print "$count so far\n"; }
      If you do $count = $_ =~ s/foo/foo/g; $count will have how many replacements happened on that line. Instead of using a temp variable to hold it, I simply += it as we go.. I should probably benchmark it to see how it compares, but its late.

      /* And the Creator, against his better judgement, wrote man.c */

      Hey, wow -- cool stuff, works perfectly. Great response time :)

      Can you explain why this made it work? I think I get why the global bit was necessary for the regex (otherwise, it wouldn't check the whole file); however, why did using while instead of if make a difference?

        Another way you could do this:
        while (<FILE>) { $count += () = m/foo/g } print $count;
        But back to the way this:
        while (<FILE>) { while (/foo/g) { $count++; print "$count so far.\n"; } }
        works is that the inner while will continue to evaluate $_ =~ /foo/g, (which is the lengthier version of: /foo/g ), anyhow it will continue to evaluate it until it has no more matches. So it will find the first foo then stop update the count, and then return to the search beginning where it left off, and so forth until it finds no more matches, but adding one to count each time it does find one. (I hope I am not confusing you further).

        The if does not work because it only cares if the expression in the parens returns a true or false value. Strings with foo are true, add one to your count and on to the next line. (strings without foo are false of course). You could do something like the following:

        my $count; while (<FILE>) { if ( $count += s/foo/foo/g ) { print "$count so far\n" ) }
        Here the s/foo/foo/g returns the number of times it replaces foo with itself adds it to the count, and the number it returns is also the number the if checks for truth ( so a line like "foo foo foo" would be evaluated as if (3), as opposed to whether or not the addition to $count was successful.)

        anyhow I hope I was some help, it is late and i tend to ramble anyhow.

        oh yeah, ++ for asking for an explanation instead of just taking code without understanding it (and having to explain it often times makes me wonder if I understand it myself).

        -enlil

        What he did is perform another loop, with each iteration happening everytime the regex matched in the string...
        Ala
        @test = ('foobarfoo', 'blahbarblah', 'bazbarbarbarbaz'); while (@test) { # # $_ has foobarfoo or blahbarblah or bazbarbarbarbaz # while(/bar/g) { # # matched bar once in foobarfoo inc by 1 # matched bar once in blahbarblah inc by 1 # matched bar 3 times in bazbarbarbarbaz inc by 1 3 times # $count++; print "$count so far\n"; } }
        Edit: Sorry for replicating information, Enlil and myself were probably typing at the same time..


        /* And the Creator, against his better judgement, wrote man.c */
Re: Counting text strings
by Hofmator (Curate) on Mar 04, 2003 at 11:19 UTC
    ... if you are really only interested in the number of occurrences and the files your are dealing with aren't extremely large, you could also read in the whole file at once. The follwing assumes that the filename was given as an argument on the command line.
    use strict; use warnings; my $file = do { local $/; <> }; my $count = 0; ++$count while $file =~ /foo/g; print $count;
    A few explanations: 'do' sets up a block and evaluates to the last statement in this block. Then we set the global variable $/ locally to undef (that means its former value is automatically restored when the block is exited). Setting this value to undef results in Perl going into slurp mode, i.e. it reads till the end of a file. The magic <> opens the command line arguments one at a time and (in scalar context) returns one line - and in slurp mode the whole file. Voila, $file contains now the whole contents.

    Enlil already gave the good explanation of the next lines and this works in the same way for one line and for a whole file. In this context it might be interesting for you to look up the /m and /s modifiers for regexes.

    -- Hofmator