ctaustin has asked for the wisdom of the Perl Monks concerning the following question:

This is what the source data looks like.
Element 1 - date
Element 2 - user
Element 3 - agency
Element 4 - url
Element 5 - garbage
Element 6 - type (pass, fail, new...)
Element 7 - module
Element 8 - more garbage


05/Jun/2003:00:01:23 user1 agency1 url1 garbage pass mod1 more_garbage
05/Jun/2003:00:03:17 user2 null url1 garbage fail mod1 more_garbage
05/Jun/2003:00:03:42 user1 agency1 url1 garbage pass mod1 more_garbage
05/Jun/2003:00:05:03 user6 agnecy2 url1 garbage pass mod1 more_garbage
05/Jun/2003:00:08:34 user3 agnecy2 url1 garbage pass mod1 more_garbage
05/Jun/2003:00:11:59 user4 agency2 url2 garbage fail mod1 more_garbage
05/Jun/2003:00:14:30 user5 agnecy2 url1 garbage new_a mod1 more_garbage
05/Jun/2003:00:15:02 user1 agency1 url1 garbage pass mod1 more_garbage
05/Jun/2003:00:16:56 user7 agency2 url2 garbage pass mod2 more_garbage
05/Jun/2003:00:17:31 user1 agency1 url1 garbage fail mod1 more_garbage
05/Jun/2003:00:17:31 user1 agency1 url1 garbage pass mod2 more_garbage


There are six of these log files generated each day. What I need to get out of them is this:

Agency URL Pass Fail New Unique Module
agency1 url1 3 1 0 1 mod1
agnecy1 url1 1 0 0 1 mod2
agency2 url1 2 1 1 3 mod1
agency2 url2 1 0 0 1 mod2


and so on. As you can see from this example, a failure doesn't count towards the unique number - the unique is only the number of successfully logins. The pass count represents the total number of logins - thus caputuring reoccuring logins and so forth. I am sure that you get what I am getting at. Stats at how and what is being used.

Once I get these lines of stats, I can then throw the results into a database for more advanced querying and storage. This part I am good time go on. I am just having some trouble gathering the stats in the form that I want/need.

I have tried using nested hashes like
my (%agency,(%url,(%module,(%type,$type_count)))); my $pass_count=0; my $fail_count=0; my $new_count=0; while ($line=<FILE>) { ($e1,$e2,$e3,$e4,$e5,$e6,$e7,$e8)=(split/\s+/,$line); $pf=substr($pf,0,4); if ($pf eq "pass") { $agency{$e4,{$e7,{$e6,$pass_count++}}}; if ($pf eq "fail) { $agency{$e4,{$e7,{$e6,$fail_count++}}}; if ($pf eq "new_") { $agency{$e4,{$e7,{$e6,$new_count++}}};
This didn't seem to work. Part of this may be due to the fact that I can't figure out how to print the results, the other is that I am not convinced that the hashes are getting properly populated.

I then tried an object oriented way
package Agency; sub new { my $class={}; $class-> {agency}=undef; $class->{url}=undef; $class->{pass}=undef; $class->{fail}=undef; $class->{new}=undef; $class->{module}=undef; bless $class; return $class; } sub init { my $class=shift; $class->{agency}=shift; $class->{url}=shift; $class->{pass}=shift; $class->{fail}=shift; $class->{new}=shift; $class->{module}=shift; } sub display { my $class=shift; print "Agency: $class->{agency} URL: $class->{url} Pass: $class- +>{pass} Fail: $class{fail} New: $class{new} Module: $class->{modu +le}\n" } package main; my $new_agency=Agency->new(); foreach $file (glob('file_name*.txt')) { open (FILE,$file)||die ("unable to open $file\n"); print "working on $file\n"; while ($line=<FILE>) { ($ts,$w,$agency,$url,$x,$pf,$module,$z)=(split/\s+/,$line); $pf=substr($pf,0,4); if ($pf eq "pass") { $new_agency->init($agency,$url,$pass_count++,$fail_coun +t,$new_count,$module); } elsif ($pf eq "fail") { $new_agency->init($agency,$url,$pass_count,$fail_count+ ++,$new_count,$module); } elsif ($pf eq "new_") { $new_agency->init($agency,$url,$pass_count,$fail_count, +$new_count++,$module); } } #while } # foreach $new_agency->display();

This didn't work either, it just gave me the agency, and module information for the last line in the last file read - with one counter being properly incremented. I tried doing the display inside the loop but this didn't help. I am guessing the problem lies somewhere with the fact that I am using "init" in each case.

Despite the fact that everything else that I have tried has failed, I think that one of these two options seems the most legitimate. So I am now asking for both advice and opinions on what the best method for obtaining this result set would be.

Replies are listed 'Best First'.
Re: nested hashes or object oriented
by BrowserUk (Patriarch) on Jun 11, 2003 at 21:10 UTC

    I think the whole OO -v- procedural debate is mute. The problem remains the same regardless of the way you choose to code it. Personally, this seems like a straight forward hash problem, with the caveat that your 'primary key' is a combination of 3 fields, agency/url/module. You then need secondary keys of pass/fail/new/user. For the first three, you increment a count, for the fourth, create a hash with the user names as keys, and the use number accumulated as your Unique count.

    Hmm. Words make that about as clear as mud. Maybe a little code will get you started.

    #! perl -slw use strict; my %stats; while( <DATA> ) { my ($date, $user, $agency, $url, undef, $type, $module, undef) = s +plit ' '; $stats{"$agency $url $module"}{$type}++; $stats{"$agency $url $module"}{user}{$user}=undef; } for my $key ( sort keys %stats ) { my ($agency, $url, $module) = split ' ', $key; printf "%8s %5s %3d %3d %3d %3d %s\n", $agency, $url, $stats{$key}{pass} ||0, $stats{$key}{fail} ||0, $stats{$key}{new_a} ||0, scalar keys %{ $stats{$key}{user} }, $module; } __DATA__ 05/Jun/2003:00:01:23 user1 agency1 url1 garbage pass mod1 more_garbage 05/Jun/2003:00:03:17 user2 null url1 garbage fail mod1 more_garbage 05/Jun/2003:00:03:42 user1 agency1 url1 garbage pass mod1 more_garbage 05/Jun/2003:00:05:03 user6 agency2 url1 garbage pass mod1 more_garbage 05/Jun/2003:00:08:34 user3 agency2 url1 garbage pass mod1 more_garbage 05/Jun/2003:00:11:59 user4 agency2 url2 garbage fail mod1 more_garbage 05/Jun/2003:00:14:30 user5 agency2 url1 garbage new_a mod1 more_garbag +e 05/Jun/2003:00:15:02 user1 agency1 url1 garbage pass mod1 more_garbage 05/Jun/2003:00:16:56 user7 agency2 url2 garbage pass mod2 more_garbage 05/Jun/2003:00:17:31 user1 agency1 url1 garbage fail mod1 more_garbage 05/Jun/2003:00:17:31 user1 agency1 url1 garbage pass mod2 more_garbage

    output

    D:\Perl\test>test agency1 url1 3 1 0 1 mod1 agency1 url1 1 0 0 1 mod2 agency2 url1 2 0 1 3 mod1 agency2 url2 0 1 0 1 mod1 agency2 url2 1 0 0 1 mod2 null url1 0 1 0 1 mod1

    ps. Please use meaningful variable names. Your table at the top of your post showing what each field was in the input records would have been unnecessary if your code used these names instead of $e1 $e2 etc.


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller


Re: nested hashes or object oriented
by djantzen (Priest) on Jun 11, 2003 at 20:55 UTC

    First, use strict and use warnings. Next, your initialization of the nested hashes won't work because the buckets of a hash can contain only scalars, not hashes themselves, which means you need to use references. my (%agency,(%url,(%module,(%type,$type_count)))); does not create nested hashes, it just passes nested lists to my. Also, assuming you have a nested structure you cannot access it like $agency{$e4,{$e7,{$e6,$pass_count++}}};, but rather like $agency{e4}{e7}{e6} = $pass_count++;. Note though that $pass_count is a post-incremented number -- not a reference -- and so the number that actually gets assigned to that hash slot will be one less than the value of $pass_count in the enclosing scope after that line. Use a preincrement or a refererence there.

    As for the object attempt, the problem is that you're class represents a single row in the log file, and so every time you loop over the Agency object you overwrite the values from previous iterations. That's why you see only the last entry when you display the results. Double check the counter as well because it will suffer the same flaw as I noted above (post-incremented). You either need to create a collection of objects in an array and iterate over that calling Agency::display on each later in the code, or create a new Agency and display it on every iteration inside the foreach loop.

    A core idea of OO is encapsulating behavior and data behind an interface, which means that your main method ought not to be figuring out which version of Agency::init to use. Rather, pass the data to the constructor and let it do the work of figuring out what the stuff is. The constructor should then call init and hand you back a complete and initialized object. Also, it is customary to refer to the first element in @_ in a method by '$class' if it is a class method (i.e., new) and by '$this' or '$self' if an instance method (i.e., init, display).

    See also: Damian Conway's ten rules for when to use OO, intro to references, References Quick Reference


    "The dead do not recognize context" -- Kai, Lexx
(jeffa) Re: nested hashes or object oriented
by jeffa (Bishop) on Jun 11, 2003 at 21:34 UTC

    "Once I get these lines of stats, I can then throw the results into a database for more advanced querying and storage."

    Seems to me that this is advanced querying. :) SQL has some handy tools to group and count. Anyways, since you are going to use a database later on, sounds like this script is a throw away. If so, OO is waaay overkill. Just get the job done procedurally and move on.

    My first thoughts were to use DBD::AnyData, but it doesn't support "advanced" SQL commands such as HAVING and GROUP BY ... shame, but here is my take on this problem with that module. Note that i don't quite fully understand what you are exactly trying to do with your counting (i didn't bother with 'new' and 'unique' hits), but this should serve as a starting point should you wish to explore DBD::AnyData. Also note that i changed your file from space delimeted to tab delimted and named the file data.txt and replaced all instances of agnecy to agency - was that typo intentional? You also had once agency listed as null - occording to your desired output, i guesstimated that the agency was number 2.

    use strict; use warnings; use DBI; use Data::Dumper; my $dbh = DBI->connect('dbi:AnyData(RaiseError=>1):'); $dbh->func( qw(log Tab data.txt), { col_names => 'date,user,agency,url,garbage,success,module' }, 'ad_catalog', ); my %pass = fetch_rows('pass'); my %fail = fetch_rows('fail'); print join("\t",qw(Agency Url Pass Fail Module)),"\n"; # loop thru @pass, try to fetch from @fail # i'll let you decide how to sort ;) for my $key (keys %pass) { my ($agency,$url,$module) = split(':',$key); my $pass = $pass{$key}; my $fail = $fail{$key} || 0; print join("\t",$agency,$url,$pass,$fail,$module),"\n"; } # trick here is to append agency, url, and module # so they are treated as one unique entity # too bad you can't use unique with AnyDBD ... # watch out, i use a colon as the delimter - YMMV sub fetch_rows { my $success = shift; my $sth = $dbh->prepare(' SELECT agency, url, module FROM log WHERE success = ? '); $sth->execute($success); my %hash; $hash{ join(':', $_->{agency}, $_->{url}, $_->{module}) }++ while $_ = $sth->fetchrow_hashref; return %hash; }
    And here is the output on your data file (with said 'corrections'):
    Agency  Url     Pass    Fail    Module
    agency2 url1    2       1       mod1
    agency1 url1    3       1       mod1
    agency1 url1    1       0       mod2
    agency2 url2    1       0       mod2
    
    UPDATE: on second thought ... just listen to BrowserUk. :) The important concept is getting the right unique rows, and DBD::AnyDB is probably overkill for this problem, only because it doesn't handle COUNT and GROUP BY. (It's still a fabulous module for data converstions though.)

    jeffa

    L-LL-L--L-LL-L--L-LL-L--
    -R--R-RR-R--R-RR-R--R-RR
    B--B--B--B--B--B--B--B--
    H---H---H---H---H---H---
    (the triplet paradiddle with high-hat)
    
Re: nested hashes or object oriented
by pzbagel (Chaplain) on Jun 11, 2003 at 20:48 UTC

    Use Data::Dumper to print the hashes out so you can see what gets populated.

    use Data::Dumper; print Dumper(\%agency);

    That is also the weirdest nested hash syntax I have ever seen. I think it's supposed to look more like this:

    $agency{$e4}->{$e7}->{$e6}++;

    That change starts to create an interesting hash once it is printed with Data::Dumper. Give it a shot.

    Addendum:

    Other things I've noticed about your code.

  • In the nested hash example, where are you getting $pf from? Bad cut and paste from the OO version? Did you mean $e6?
  • You are not storing the agency in the first level of the hash(stored in $e3). You may need:
    $agency{$e3}->{$e4}->{$e7}->{$e6}++;
  • The way you are incrementing the counters is incorrect, you want to check those more closely so you get the correct results. Notice that you are storing the current value+1. What you probably wanted was to increment the value in the hash by one AND increment the cumulative count(which I assume the *_count variables are for).
      You can also omit superfluous arrows, so
      $agency{$e3}->{$e4}->{$e7}->{$e6}++;
      can be
      $agency{$e3}{$e4}{$e7}{$e6}++;
Re: nested hashes or object oriented
by artist (Parson) on Jun 11, 2003 at 21:35 UTC
    The way I understand your logic, the following code should work even for the cases like
    Pass Fail New
    p1   
    -     p1 
    p1
    p1
    p2
    -     p3
    -     -    p4
    
    Your Unique will be 3 in this case. (for p1,p2,p4).

    I have assumed that for the output table, your primary key consists of agency, url and module as figured out from the results.

    use strict; use warnings; use Data::Dumper; my $hash; while(<DATA>){ chomp; my @fields = split /\s+/; my ($agency,$url,$mod) = @fields[2,3,6]; my ($user,$state) = @fields[1,5]; my $item = $hash->{$agency}{$url}{$mod}; if($state eq 'pass'){ $hash->{$agency}{$url}{$mod}{pass} += 1; }elsif($state eq 'fail'){ $hash->{$agency}{$url}{$mod}{fail} += 1; } elsif($state =~ /new_/){ $hash->{$agency}{$url}{$mod}{new} += 1; } $hash->{$agency}{$url}{$mod}{users}{$user} += 1 if $state ne 'fail +'; } print join "\t" => qw(Agency URL Pass Fail New Unique Module),"\n"; foreach my $agency (keys %{$hash}){ foreach my $url (keys %{$hash->{$agency}}){ foreach my $mod (keys %{$hash->{$agency}{$url}}){ my $item = $hash->{$agency}{$url}{$mod}; my $pass = $item->{pass} || 0; my $fail =$item->{fail} || 0; my $new = $item->{new} || 0; my $unique =scalar keys %{$item->{users}}; print join "\t" => ($agency,$url,$pass,$fail,$new,$unique,$mod +),"\n"; } } } __DATA__ 05/Jun/2003:00:01:23 user1 agency1 url1 garbage pass mod1 more_garbage 05/Jun/2003:00:03:17 user2 null url1 garbage fail mod1 more_garbage 05/Jun/2003:00:03:42 user1 agency1 url1 garbage pass mod1 more_garbage 05/Jun/2003:00:05:03 user6 agnecy2 url1 garbage pass mod1 more_garbage 05/Jun/2003:00:08:34 user3 agnecy2 url1 garbage pass mod1 more_garbage 05/Jun/2003:00:11:59 user4 agency2 url2 garbage fail mod1 more_garbage 05/Jun/2003:00:14:30 user5 agnecy2 url1 garbage new_a mod1 more_garbag +e 05/Jun/2003:00:15:02 user1 agency1 url1 garbage pass mod1 more_garbage 05/Jun/2003:00:16:56 user7 agency2 url2 garbage pass mod2 more_garbage 05/Jun/2003:00:17:31 user1 agency1 url1 garbage fail mod1 more_garbage 05/Jun/2003:00:17:31 user1 agency1 url1 garbage pass mod2 more_garbage
    Agency	URL	Pass	Fail	New	Unique	Module	
    null	url1	0	1	0	0	mod1	
    agency1	url1	3	1	0	1	mod1	
    agency1	url1	1	0	0	1	mod2	
    agency2	url2	0	1	0	0	mod1	
    agency2	url2	1	0	0	1	mod2	
    agnecy2	url1	2	0	1	3	mod1
    
    artist
      I appreciated everyones feedback. I knew that I was in over my head and it is clear that I have a seemingly unmanageable path of learning and discover ahead.

      I will work on implementing some of these ideas and see what I come up with.

      Thanks again.