vili has asked for the wisdom of the Perl Monks concerning the following question:

Greetings once again,
Extracting valuable information from custom apache logs genetrates some 30meg files on a daily basis. What I need to do
is be able to come up with weekly/monthly/etc. summaries for visits, category visits returning users, and on and on. On the fly(next day)
the format of the daily files is as follows:
UniqueUser LastVisitTime PrimaryCategory SecondaryCategory PageViews MerchantClicks Sessions MarketingMode ZIP
0ce46e475f94ecb9 01/Jan/2003:16:00:08 Computers Computers 1 0 1 no_mode 00000
188c0530ac92475a 01/Jan/2003:16:00:02 Computers Computers 1 1 1 no_mode 44614
189a4a75d0cbad03 01/Jan/2003:16:00:01 No_category No_category 1 0 1 no_mode 00000
189e45678964fcf6 01/Jan/2003:16:00:07 Electronics Electronics 1 0 1 no_mode 00000
18a416ba3d3c7a8d 01/Jan/2003:16:00:12 No_category No_category 2 0 1 no_mode 00000
18aa11982e30e1ef 01/Jan/2003:16:00:07 No_category No_category 1 0 1 no_mode 00000


the daily files are the output of:
print OUTFILE "$ut\t${$users{$ut}}[0]\t${$users{$ut}}[1]\t${$users{$ +ut}}[2]\t${$users{$ut}}[3]\t${$users{$ut}}[4]\t${$users{$ut}}[5]\t${$ +users{$ut}}[6]\t${$users{$ut}}[7]\n";

as I'm going through the daily files, I'd like to be able, for a particular unique "ut" to keep it as a key, replace the date with the
last seen date, add all the numeric values such as page views, clicks, and sessions.
So here I am trying to "consolidate" all of this information, into a giant hash. I can already sense the disaproval
This however is the only thing that comes to mind. If anyone have any suggestions, on this, they would be greatly
appreciated. And if there is another way to do it, which by definition there is, I'd love to find out about it

Thank You in advance,
~vili
Addicted to sniffing 802.11

Minor typo / author's req. - dvergin 2003-08-22

  • Comment on Merging Hashes adding multiple values for keys, and replacing others
  • Download Code

Replies are listed 'Best First'.
Re: Merging Hashes adding multiple values for keys, and replacing others
by perrin (Chancellor) on Aug 21, 2003 at 18:48 UTC
    That's too much data for an in-memory hash, but you could probably make it work with a dbm file and something like MLDBM.
Re: Merging Hashes adding multiple values for keys, and replacing others
by CombatSquirrel (Hermit) on Aug 21, 2003 at 18:18 UTC
    How about the following code:
    #!perl use strict; use warnings; use Data::Dumper; my %monthLookup = ( Jan => 1, Feb => 2, Mar => 3, Apr => 4, May => 5, Jun => 6, Jul => 7, Aug => 8, Sep => 9, Oct => 10, Nov => 11, Dec => 12 ); sub CompDates($$) { my ($first, $second) = @_; $first = [($first =~ m!(\d{2])/(\w{3})/(\d{4}):(\d{2}):(\d{2}):(\ +d{2})!)]; $second = [($second =~ m!(\d{2])/(\w{3})/(\d{4}):(\d{2}):(\d{2}):(\ +d{2})!)]; $first->[1] = $monthLookup{$first->[1]}; $second->[1] = $monthLookup{$second->[1]}; return (($first->[0] <=> $second->[0]) || ($first->[1] <=> $second->[1]) || ($first->[2] <=> $second->[2]) || ($first->[3] <=> $second->[3]) || ($first->[4] <=> $second->[4]) || ($first->[5] <=> $second->[5]) ); } my %users; while (<DATA>) { my ($user, $time, $primary, $secondary, $views, $clicks, $sessions, $mode, $zip) = split; if ($users{$user}) { if (CompDates($users{$user}->{LastVisitTime}, $time) < 0) { $users{$user}->{LastVisitTime} = $time; } } else { $users{$user}->{LastVisitTime} = $time; $users{$user}->{PrimaryCategory} = $primary; $users{$user}->{SecondaryCategory} = $secondary; $users{$user}->{MarketingMode} = $mode; $users{$user}->{ZIP} = $zip; } $users{$user}->{PageViews} += $views; $users{$user}->{MerchantClicks} += $clicks; $users{$user}->{Sessions} += $sessions; } print Dumper(\%users); __DATA__ 0ce46e475f94ecb9 01/Jan/2003:16:00:08 Computers Computers 1 0 1 no_mod +e 00000 188c0530ac92475a 01/Jan/2003:16:00:02 Computers Computers 1 1 1 no_mod +e 44614 189a4a75d0cbad03 01/Jan/2003:16:00:01 No_category No_category 1 0 1 no +_mode 00000 189e45678964fcf6 01/Jan/2003:16:00:07 Electronics Electronics 1 0 1 no +_mode 00000 18a416ba3d3c7a8d 01/Jan/2003:16:00:12 No_category No_category 2 0 1 no +_mode 00000 18aa11982e30e1ef 01/Jan/2003:16:00:07 No_category No_category 1 0 1 no +_mode 00000
    This should do it, as far as I understood your problem. Post a reply if it does not.
    Cheers, CombatSquirrel.
      Thanks a lot CombatSquirrel, that was very close to what I'm looking for. I did
      rewrite the date part, as I have decided to use unix timestamp, as that simplifies things some.
      And may I say you RoCk! here's, the updated code:
      #!/usr/bin/perl # use warnings; use strict; use Data::Dumper; sub apache2epoch() { use POSIX; my $datetime=$ARGV[0]; my $i=0; my %months; %months = map { $_ => $i++ } ("Jan","Feb","Mar","Apr","May","Jun", "Jul","Aug","Sep","Oct","Nov","Dec"); $_[0] =~ m((\d+)/(\w+)/(\d+):(\d+):(\d+):(\d+)); return &mktime($6,$5,$4,$1,$months{$2},$3-1900,0,0,-1); } + + my %users; while (<DATA>) { my ($user, $time, $primary, $secondary, $views, $clicks, $sessions, $mode, $zip) = split; if ($users{$user}) { if ((&apache2epoch($time))-(&apache2epoch(($users{$user}->{LastVisitTi +me}))) >1800 ) { $users{$user}->{LastVisitTime} = $time; } } else { $users{$user}->{LastVisitTime} = $time; $users{$user}->{PrimaryCategory} = $primary; $users{$user}->{SecondaryCategory} = $secondary; $users{$user}->{MarketingMode} = $mode; $users{$user}->{ZIP} = $zip; } $users{$user}->{PageViews} += $views; $users{$user}->{MerchantClicks} += $clicks; $users{$user}->{Sessions} += $sessions; } print Dumper(\%users); + __DATA__ 188c0530ac92475a 01/Jan/2003:16:00:02 Computers Computers 1 1 1 no_mod +e 44614 0ce46e475f94ecb9 01/Jan/2003:16:00:08 Computers Computers 1 0 1 no_mod +e 00000 189e45678964fcf6 01/Jan/2003:16:00:07 Electronics Electronics 1 0 1 no +_mode 00000 189a4a75d0cbad03 01/Jan/2003:16:00:01 No_category No_category 1 0 1 no +_mode 00000 18a416ba3d3c7a8d 01/Jan/2003:16:00:12 No_category No_category 2 0 1 no +_mode 00000 18aa11982e30e1ef 01/Jan/2003:16:00:07 No_category No_category 1 0 1 no +_mode 00000 0ce46e475f94ecb9 01/Jan/2003:17:00:08 Computers Computers 1 0 1 no_mod +e 00000 189a4a75d0cbad03 01/Jan/2003:17:00:01 No_category No_category 1 0 1 no +_mode 00
      btw, on the CompDates sub, if I encounter the same $user it gets the errors
      Use of uninitialized value in numeric comparison (<=>) at ./overall.pl line 28, <DATA> line 7.
      thanks again CombatSquirrel, and Cheers to you too

      ~vili
      sniff sniff 802.11
Re: Merging Hashes adding multiple values for keys, and replacing others
by eric256 (Parson) on Aug 21, 2003 at 17:57 UTC

    I'd think this would be a good case for a database. Although i can't realy grasp your problem. With the database you should be able to then group by whichever colum you want. Perhaps not the best solution but the first one to come to mind.

    Maybe apache can log straight into a database, then you could just work with the data

    ___________
    Eric Hodges
      The problem with an apache to mysql logging is that apache generates 7 million + log entries per day
      between 7 servers. that is a huge database, and thus my reluctance to go that way, i guess i can turn
      to a grep/awk/sort for some of the things... i was however hoping for a perl solution.
      ~vili
        perl hashes take at least 1.5x the amount of space as the original data in memory -- so are you saying it makes more sense to you to build a huge hash to operate on in ram than it does to use a real database (on disk) because the database on disk will be too large? dont get it. With proper normilization the database should be smaller in size than the original data.

        -Waswas

        The huge database wont be any bigger than you log files. (at least it shouldn't be).

        Also you gain considerable speed and power in removeing old entries on the fly. Just my two cents though. Its definitly possible that file handling is better for you.

        ___________
        Eric Hodges
Re: Merging Hashes adding multiple values for keys, and replacing others
by tadman (Prior) on Aug 21, 2003 at 18:00 UTC
    You might just want to use an RDBMS like MySQL instead of rolling your own via some wacky hash. At least, this is what comes to mind first. What you're talking about here is rather murky.

    Once in a database like that, you can perform all kinds of queries on your data.
      Murky indeed, log analysis that is.
      ~vili