Extracting and counting string occurences

joec_ has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Extracting and counting string occurences by ikegami (Patriarch) on Aug 29, 2008 at 08:47 UTC
`while (<$fh_log>) { my ($month, $user) = /^LOGGING: \S+?-(\S+) \S+: STARTING FLUSH: (\S ++)$/ or next; $flushes{$month}{$user}++; $flushes{$month}{TOTAL}++; }` [download]	[reply] [d/l]
Re^2: Extracting and counting string occurences by gone2015 (Deacon) on Aug 29, 2008 at 10:11 UTC
This will work fine if the input data can be completely relied on to be in exactly the form you expect. Two things can take a chunk out of your posterior: (a) the data is variable (in any way) -- anything generated by humans is suspect (of course), anything generated by systems written by humans may not be perfect (let's face it); (b) you didn't fully understand the problem. More insidious yet, nobody may notice that the output of this program is incorrect until something REALLY BAD happens. So, I'd do something like: `if (/^LOGGING: \S+?-(\S+) \S+: (\S+) [^:]: (\S+)$/) { my ($month, $what, $user) = ($1, $2, $3) ; if ($what eq 'STARTING') { $flushes{$month}{$user}++ ; $flushes{$month}{TOTAL}++ ; } ; } else { # Any error reporting you like, eg: chomp ; print STDERR "Line $.: '$_' ??\n" ; } ;` [download] This is not a lot of extra code. Of course, you could do much more validation of the input and/or be less strict about some things (eg white-space and case) Who knows, you might be relieved when you discover the file contains: `.... LOGGING: 31-Mar-2008 23:59:10: ENDING ON A HIGH: countryjoe LOGGING: 1-Apr-2008 00:01:22: STARTING FLUSH: gotcha ....` [download] before* the auditors.	[reply] [d/l] [select]
Re^2: Extracting and counting string occurences by joec_ (Scribe) on Aug 29, 2008 at 10:27 UTC
Hi thanks for the quick response. I am wondering now how i output the data? If i try: `print $flushes{$month}{$user}; print $flushes{$month}{TOTAL};` [download] outside the while loop, it just prints hundreds of random numbers. I am really new to Perl :) Thanks. Joe.	[reply] [d/l]
Re^3: Extracting and counting string occurences by gone2015 (Deacon) on Aug 29, 2008 at 15:21 UTC
Well, I recommend doing some reading ! His/Her grace, ikegami (Archbishop), used a hash (`%flushes`) to collect the results. Hashes are a wonderful thing, and worth getting to grips with. However, his/her grace did use a "two dimensional" hash, which is rather throwing you into the murkier part of the deep end. So, let's simplify this, a bit. Your first requirement was to count the number of whatever they were, on a monthly basis. The regular expression extracted the month name and year from the date into the string `$month`, for example `'Aug-2008'`. You could use the hash `%total_counts` to count the totals for each month, thus: `$total_counts{$month}++`. To extract the totals for each month you could: `while (my ($month, $total) = each(%total_counts)) { print "$month: $total\m" ; } ;` [download] where `each(%total_counts)` walks the hash and returns key and value pairs -- but in some apparently random order. You could: `foreach my $month (sort keys %total_counts) { print "$month: $total_counts{$month}\n" ; } ;` [download] where `keys %total_counts` returns a list of all of the keys in `%total_counts` (in some apparently random order), which is then sorted alphabetically -- so the output is the counts so sorted. You probably want: `foreach my $month (sort_months(keys %total_counts)) { ....` [download] where the subroutine `sort_months()` is left as an exercise -- taking in a list of month strings and returning a list in the required order. Your second requirement was to count for each user, on a monthly basis. So you want a hash, say `%user_counts`, with an entry for each user: `$user_counts{$user}`. Each entry needs to be a separate count for each month... now things are getting tricky. A hash entry is a single scalar value; you cannot have hash entries which are arrays or hashes. However, you can have scalars which are references to arrays or hashes. So, where the entries in `%total_counts` are simple (numeric) scalars, the entries in `%user_counts` are references to hashes, each hash being similar to `%total_counts` -- that is, the key value is a month string (eg `'Aug-2008'`) and the value is the count for that month. The short hand in Perl to refer to a count value in this structure is `$user_counts{$user}{$month}` -- meaning: (a) get entry in `%user_counts` whose key is `$user`, (b) that entry refers to a hash, get the entry in that hash whose key is `$month`. To extract the per user data is: `foreach my $user (sort keys %user_counts) { my $r_counts = $user_counts{$user} ; print "$user:\n" ; foreach my $month (sort_months(keys %$r_counts)) { my $count = $user_counts{$user}{$month} ; print " $month $count\n" ; } ; } ;` [download] where `$r_counts` is a reference to a hash that gives the count per month. `keys %$r_counts` returns the keys in "the hash" (`'%'`) "refered to by the scalar `$r_counts`". Depending in how you want the results organised, you may want to change how it's printed or even how it's stored -- the essence is that you can extract the keys using `keys` and then use them in whatever order you want to look up the data you have collected. His/Her Grace chose the `$month` as the primary key, and chose to hold the totals count as the conventional user 'TOTAL' -- you can make up your own mind which order the keys should be in, and whether you think there's an real chance of a real user called 'TOTAL'.	[reply] [d/l] [select]
Re^3: Extracting and counting string occurences by mr_mischief (Monsignor) on Aug 29, 2008 at 16:21 UTC
Given ikegami's code: `while (<$fh_log>) { my ($month, $user) = /^LOGGING: \S+?-(\S+) \S+: STARTING FLUSH: (\S ++)$/ or next; $flushes{$month}{$user}++; $flushes{$month}{TOTAL}++; }` [download] which produces a data structure similar to what this code does: `my %flushes = ( 'Aug-2008' => { 'ces' => 5, 'cjc' => 7, 'TOTAL' => 12, }, 'Jul-2008' => { 'mhs' => 1, 'ces' => 3, 'cjc' => 4, 'TOTAL' => 8, }, );` [download] you could do this to get the output: `for my $month ( keys %flushes ) { print $month . "\n"; print 'Total: ' . $flushes{$month}{TOTAL} . "\n#####\n"; for my $user ( sort keys %{ $flushes{$month} } ) { next if $user eq 'TOTAL'; printf "%-20s%3d\n", $user, $flushes{$month}{$user}; } print "\n"; }` [download] If you want higher marks, you might want to use strictures and warnings. That will require declaring %flushes in the proper scope. You'll need to add opening the file and error checks to his code, anyway. If you'd rather sort by uses (descending) than by their user names, you could try this instead: `for my $month ( keys %flushes ) { print $month . "\n"; print 'Total: ' . $flushes{$month}{TOTAL} . "\n#####\n"; for my $user ( sort { $flushes{$month}{$b} <=> $flushes{$month}{$a} } keys %{ + $flushes{$month} } ) { next if $user eq 'TOTAL'; printf "%-20s%3d\n", $user, $flushes{$month}{$user}; } print "\n"; }` [download] Now, is there a particular part of that you're having problems understanding or did you just want someone to write the code for you?	[reply] [d/l] [select]
Re: Extracting and counting string occurences by apl (Monsignor) on Aug 29, 2008 at 11:36 UTC
I'd strongly suggest reading Pattern Matching, Regular Expressions, and Parsing and Data Type: Hash. I've found it's easier to learn a language by writing in it rather than being given examples of it.	[reply]