Counting text strings

vector40 has asked for the wisdom of the Perl Monks concerning the following question:

Hey, folks; new Perl user, just tinkering. This strikes me as something that should be easy, but it's behaving peculiarly.

What I want is a brief script that takes a text file, sifts through it according to a regex string, and count off how many times it finds said string. It prints this, telling me how many times it finds it (I've got it set up so that it reports back after each find, but it doesn't much matter; you could wait until it's done).

Observe: (beware, lots of painfully explicit stuff in here to aid debugging)

#!/usr/bin/perl -w
use strict;
use diagnostics;

print "Do some regex now, shall we?\n";

print "Give me a file. Full path, please.\n";

chomp(my $filepath = <STDIN>);

print "The filename is $filepath\n";

open FILE, "$filepath" or die "Couldn't open it. Error: $!\n";

my $count = 0; # Initialize the variable, just to be safe

while (<FILE>) {
    if (/foo/) {
        $count++;
        print "$count so far.\n";
    }
}

print "Program exiting\n";
[download]

Here's the problem. This sort of works... but only if each instance of the string is on a different line. This makes sense, I suppose; after all, it's taking each line individually. But if it's got four occurences of the string on one line, then it will count them all as one, which isn't too helpful.

Thoughts?

Thanks,

- Brandon

Comment on Counting text strings Download Code

Replies are listed 'Best First'.
Re: Counting text strings by Enlil (Parson) on Mar 04, 2003 at 05:58 UTC
You could change this code: `while (<FILE>) { if (/foo/) { $count++; print "$count so far.\n"; } }` [download] to something like this: `while (<FILE>) { while (/foo/g) { $count++; print "$count so far.\n"; } }` [download] update: or something like this: `while (<FILE>) { $count++ while /foo/g; } print $count;` [download] -enlil	[reply] [d/l] [select]
Re: Re: Counting text strings by l2kashe (Deacon) on Mar 04, 2003 at 06:30 UTC
Or something slightly different yet the same `while ( <FILE> ) { $count += $_ =~ s/foo/foo/g; print "$count so far\n"; }` [download] If you do $count = $_ =~ s/foo/foo/g; $count will have how many replacements happened on that line. Instead of using a temp variable to hold it, I simply += it as we go.. I should probably benchmark it to see how it compares, but its late. /* And the Creator, against his better judgement, wrote man.c */	[reply] [d/l]
Re: Re: Counting text strings by vector40 (Initiate) on Mar 04, 2003 at 06:09 UTC
Hey, wow -- cool stuff, works perfectly. Great response time :) Can you explain why this made it work? I think I get why the global bit was necessary for the regex (otherwise, it wouldn't check the whole file); however, why did using while instead of if make a difference?	[reply]
Re: Re: Re: Counting text strings by Enlil (Parson) on Mar 04, 2003 at 06:46 UTC
Another way you could do this: `while (<FILE>) { $count += () = m/foo/g } print $count;` [download] But back to the way this: `while (<FILE>) { while (/foo/g) { $count++; print "$count so far.\n"; } }` [download] works is that the inner `while` will continue to evaluate `$_ =~ /foo/g`, (which is the lengthier version of: `/foo/g` ), anyhow it will continue to evaluate it until it has no more matches. So it will find the first foo then stop update the count, and then return to the search beginning where it left off, and so forth until it finds no more matches, but adding one to count each time it does find one. (I hope I am not confusing you further). The `if` does not work because it only cares if the expression in the parens returns a true or false value. Strings with foo are true, add one to your count and on to the next line. (strings without foo are false of course). You could do something like the following: `my $count; while (<FILE>) { if ( $count += s/foo/foo/g ) { print "$count so far\n" ) }` [download] Here the `s/foo/foo/g` returns the number of times it replaces foo with itself adds it to the count, and the number it returns is also the number the if checks for truth ( so a line like "foo foo foo" would be evaluated as `if (3)`, as opposed to whether or not the addition to `$count` was successful.) anyhow I hope I was some help, it is late and i tend to ramble anyhow. oh yeah, ++ for asking for an explanation instead of just taking code without understanding it (and having to explain it often times makes me wonder if I understand it myself). -enlil	[reply] [d/l] [select]
Re: Re: Re: Re: Counting text strings by vector40 (Initiate) on Mar 04, 2003 at 07:55 UTC
Re: Re: Re: Counting text strings by l2kashe (Deacon) on Mar 04, 2003 at 06:38 UTC
What he did is perform another loop, with each iteration happening everytime the regex matched in the string... Ala `@test = ('foobarfoo', 'blahbarblah', 'bazbarbarbarbaz'); while (@test) { # # $_ has foobarfoo or blahbarblah or bazbarbarbarbaz # while(/bar/g) { # # matched bar once in foobarfoo inc by 1 # matched bar once in blahbarblah inc by 1 # matched bar 3 times in bazbarbarbarbaz inc by 1 3 times # $count++; print "$count so far\n"; } }` [download] Edit: Sorry for replicating information, Enlil and myself were probably typing at the same time.. /* And the Creator, against his better judgement, wrote man.c */	[reply] [d/l]
Re: Counting text strings by Hofmator (Curate) on Mar 04, 2003 at 11:19 UTC
... if you are really only interested in the number of occurrences and the files your are dealing with aren't extremely large, you could also read in the whole file at once. The follwing assumes that the filename was given as an argument on the command line. `use strict; use warnings; my $file = do { local $/; <> }; my $count = 0; ++$count while $file =~ /foo/g; print $count;` [download] A few explanations: 'do' sets up a block and evaluates to the last statement in this block. Then we set the global variable $/ locally to undef (that means its former value is automatically restored when the block is exited). Setting this value to undef results in Perl going into slurp mode, i.e. it reads till the end of a file. The magic `<>` opens the command line arguments one at a time and (in scalar context) returns one line - and in slurp mode the whole file. Voila, $file contains now the whole contents. Enlil already gave the good explanation of the next lines and this works in the same way for one line and for a whole file. In this context it might be interesting for you to look up the /m and /s modifiers for regexes. -- Hofmator	[reply] [d/l] [select]