joec_ has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I have a text log file which looks like:
LOGGING: 29-Aug-2008 09:30:17: STARTING FLUSH: khxz766 LOGGING: 29-Aug-2008 09:40:07: CLOSING FLUSH: khxz766

etc...

What i want to be able to do is count the number of times the STARTING lines occur each month for say 2008 back to 2006 and output it to stdout. Also what would be useful would be to count the number of times users (khxz766) started the application (flush) in each month and output it. Any help greatly appreciated. Thanks. Joe

Replies are listed 'Best First'.
Re: Extracting and counting string occurences
by ikegami (Patriarch) on Aug 29, 2008 at 08:47 UTC
    while (<$fh_log>) { my ($month, $user) = /^LOGGING: \S+?-(\S+) \S+: STARTING FLUSH: (\S ++)$/ or next; $flushes{$month}{$user}++; $flushes{$month}{TOTAL}++; }

      This will work fine if the input data can be completely relied on to be in exactly the form you expect.

      Two things can take a chunk out of your posterior: (a) the data is variable (in any way) -- anything generated by humans is suspect (of course), anything generated by systems written by humans may not be perfect (let's face it); (b) you didn't fully understand the problem. More insidious yet, nobody may notice that the output of this program is incorrect until something REALLY BAD happens.

      So, I'd do something like:

      if (/^LOGGING: \S+?-(\S+) \S+: (\S+) [^:]*: (\S+)$/) { my ($month, $what, $user) = ($1, $2, $3) ; if ($what eq 'STARTING') { $flushes{$month}{$user}++ ; $flushes{$month}{TOTAL}++ ; } ; } else { # Any error reporting you like, eg: chomp ; print STDERR "Line $.: '$_' ??\n" ; } ;
      This is not a lot of extra code. Of course, you could do much more validation of the input and/or be less strict about some things (eg white-space and case)

      Who knows, you might be relieved when you discover the file contains:

      .... LOGGING: 31-Mar-2008 23:59:10: ENDING ON A HIGH: countryjoe LOGGING: 1-Apr-2008 00:01:22: STARTING FLUSH: gotcha ....
      before the auditors.

      Hi thanks for the quick response. I am wondering now how i output the data? If i try:
      print $flushes{$month}{$user}; print $flushes{$month}{TOTAL};

      outside the while loop, it just prints hundreds of random numbers.

      I am really new to Perl :)

      Thanks. Joe.

        Well, I recommend doing some reading ! His/Her grace, ikegami (Archbishop), used a hash (%flushes) to collect the results. Hashes are a wonderful thing, and worth getting to grips with.

        However, his/her grace did use a "two dimensional" hash, which is rather throwing you into the murkier part of the deep end.

        So, let's simplify this, a bit. Your first requirement was to count the number of whatever they were, on a monthly basis. The regular expression extracted the month name and year from the date into the string $month, for example 'Aug-2008'. You could use the hash %total_counts to count the totals for each month, thus: $total_counts{$month}++.

        To extract the totals for each month you could:

        while (my ($month, $total) = each(%total_counts)) { print "$month: $total\m" ; } ;
        where each(%total_counts) walks the hash and returns key and value pairs -- but in some apparently random order. You could:
        foreach my $month (sort keys %total_counts) { print "$month: $total_counts{$month}\n" ; } ;
        where keys %total_counts returns a list of all of the keys in %total_counts (in some apparently random order), which is then sorted alphabetically -- so the output is the counts so sorted. You probably want:
        foreach my $month (sort_months(keys %total_counts)) { ....
        where the subroutine sort_months() is left as an exercise -- taking in a list of month strings and returning a list in the required order.

        Your second requirement was to count for each user, on a monthly basis. So you want a hash, say %user_counts, with an entry for each user: $user_counts{$user}. Each entry needs to be a separate count for each month... now things are getting tricky. A hash entry is a single scalar value; you cannot have hash entries which are arrays or hashes. However, you can have scalars which are references to arrays or hashes. So, where the entries in %total_counts are simple (numeric) scalars, the entries in %user_counts are references to hashes, each hash being similar to %total_counts -- that is, the key value is a month string (eg 'Aug-2008') and the value is the count for that month. The short hand in Perl to refer to a count value in this structure is $user_counts{$user}{$month} -- meaning: (a) get entry in %user_counts whose key is $user, (b) that entry refers to a hash, get the entry in that hash whose key is $month.

        To extract the per user data is:

        foreach my $user (sort keys %user_counts) { my $r_counts = $user_counts{$user} ; print "$user:\n" ; foreach my $month (sort_months(keys %$r_counts)) { my $count = $user_counts{$user}{$month} ; print " $month $count\n" ; } ; } ;
        where $r_counts is a reference to a hash that gives the count per month. keys %$r_counts returns the keys in "the hash" ('%') "refered to by the scalar $r_counts".

        Depending in how you want the results organised, you may want to change how it's printed or even how it's stored -- the essence is that you can extract the keys using keys and then use them in whatever order you want to look up the data you have collected.

        His/Her Grace chose the $month as the primary key, and chose to hold the totals count as the conventional user 'TOTAL' -- you can make up your own mind which order the keys should be in, and whether you think there's an real chance of a real user called 'TOTAL'.

        Given ikegami's code:
        while (<$fh_log>) { my ($month, $user) = /^LOGGING: \S+?-(\S+) \S+: STARTING FLUSH: (\S ++)$/ or next; $flushes{$month}{$user}++; $flushes{$month}{TOTAL}++; }
        which produces a data structure similar to what this code does:
        my %flushes = ( 'Aug-2008' => { 'ces' => 5, 'cjc' => 7, 'TOTAL' => 12, }, 'Jul-2008' => { 'mhs' => 1, 'ces' => 3, 'cjc' => 4, 'TOTAL' => 8, }, );
        you could do this to get the output:
        for my $month ( keys %flushes ) { print $month . "\n"; print 'Total: ' . $flushes{$month}{TOTAL} . "\n#####\n"; for my $user ( sort keys %{ $flushes{$month} } ) { next if $user eq 'TOTAL'; printf "%-20s%3d\n", $user, $flushes{$month}{$user}; } print "\n"; }
        If you want higher marks, you might want to use strictures and warnings. That will require declaring %flushes in the proper scope. You'll need to add opening the file and error checks to his code, anyway.

        If you'd rather sort by uses (descending) than by their user names, you could try this instead:

        for my $month ( keys %flushes ) { print $month . "\n"; print 'Total: ' . $flushes{$month}{TOTAL} . "\n#####\n"; for my $user ( sort { $flushes{$month}{$b} <=> $flushes{$month}{$a} } keys %{ + $flushes{$month} } ) { next if $user eq 'TOTAL'; printf "%-20s%3d\n", $user, $flushes{$month}{$user}; } print "\n"; }

        Now, is there a particular part of that you're having problems understanding or did you just want someone to write the code for you?

Re: Extracting and counting string occurences
by apl (Monsignor) on Aug 29, 2008 at 11:36 UTC