hallikpapa has asked for the wisdom of the Perl Monks concerning the following question:

I have two apache logs I am trying to compare Date, IP, & User Agent data to find matches. I am using Apache::LogRegex to load all of the data into a hash. I am pretty rusty on my perl and trying to see the best way.

I have 3 log files that have GET requests in them, and 3 log files that have POST requests in them. I want to do the compare against the fields I listed above (Date, IP, or UserAgent) between the two hashes that contain GET & POST logs to find matches.

So I can read each line, and access the data easily, but I am stuck on how to get the contents of all 6 files into two separate hashes. Each line comes as a hash, so perhaps a hash of hashes for both GET & POST logs, or an array of hashes? Is my push statement below the best way to do it, or do I need to setup some kind of keys so they don't overwrite?

As a side note, what do you think would be the best way to compare fields between two array of hashes? I could come up with some hacky, really process expensive way by doing lots of loops, but I am assuming there is some faster way.

Please help with a little direction on the best way to acoomplish this.
#!/usr/bin/perl -w use Apache::LogRegex; my $lr; my $log_format = '"%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User +-Agent}i\""'; eval { $lr = Apache::LogRegex->new($log_format) }; die "Unable to parse log line: $@" if ($@); my $get_logs = ("march-logs/march-bannat.txt", "march-logs/march-logs-web2/march-bannat.txt", "march-logs/march-logs-web3/march-bannat.txt"); my $post_logs = ("march-logs/march-post.txt", "march-logs/march-logs-web2/march-post.txt", "march-logs/march-logs-web3/march-post.txt"); my %data; my %getRecords; my $postRecords; foreach ($get_logs) { my @array = &logToHash($_); } sub logToHash { my $file = $_; my %hash; my @AoH; open LOG, $file or die $!; while ( my $line_from_logfile = <LOG> ) { eval { %data = $lr->parse($line_from_logfile); }; if (%data) { push @AoH, %data; } } return @AoH; }
I noticed when I do a print Dumper(\@array) when the subroutine returns, that it gets a bunch of data, but it prints them like this, key on top of value, instead of like $key => $value. Is this correct? Am I pushing the data incorrectly?
'"%h', 'access_log.9.gz:XX.XX.XX.XX', '%{Referer}i', '-', '%t', '[19/Mar/2009:02:03:46 -0500]', '%r', 'GET /02230909 HTTP/1.1', '%{User-Agent}i\\""', 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; FunWebPr +oducts; GTB5; .NET CLR 1.0.3705; .NET CLR 1.1.4322; .NET CLR 2.0.5072 +7)', '%b', '20', '%l', '-', '%u', '-', '%>s', '302',

Replies are listed 'Best First'.
Re: Adding data to hashes & comparing
by moritz (Cardinal) on Mar 30, 2009 at 20:26 UTC
    push @AoH, %data

    This flattens the hash %data, which is not what you want. Instead push @AoH, \%data, which pushes a (not flat) reference.

    Please read perlreftut on how you can use references to build complex data structures.

      Ah. That makes sense. I was wondering what happened. So I have two Arrays, @get_array and @post_array which are array of hashes. What is a more efficient way that doing nested foreach loops for when I do my comparing? That seems like a bunch of uneeded processing, and there might be some kind of seek, or....?
Re: Adding data to hashes & comparing
by ELISHEVA (Prior) on Mar 30, 2009 at 20:41 UTC

    Some quick observations about your code so far:

    • Unless you are using a very old perl, you don't need the & before function calls: logToHash($_) will do just fine.
    • You will find your debugging life much easier if you add use strict; use warnings; immediately after the shabang line.

    As to your question about data structures and comparisons , perhaps you could be more specific about how you want to compare fields - are you counting the GET/POST requests for each user agent? each IP? each data? some combination of the three?

    Your plans will affect your choice of data structure. For instance, if you want to count all HTTP requests with the same user agent, you will need an HoAoH hash where each key is a user agent and each value is an array containing all of request hashes sharing the same user agent.

    You can build this hash most efficiently while you are reading in the data. Depending on your goals, you may be able to eliminate the AoH hash entirely and rely solely on the HoAoH hashes.

    Best, beth

      For instance, I want to compare IP's in the get and post arrays to see if there are any matches.

      Then I will look at the User Agent, to see if those are matching, and finally will also check the time stamps to see if they fall within a certain alloted time.

      Basically for a few days something wasn't being tracked, so I would like to go back and get the numerical data out of the GET request and associate it to the correct post request (which will have some user data in it). So I need to track timestamps, user agent, and the IP address to get as close as possible.
        If I am understanding you correctly, you could store each record in a HoHoHoA. For each record you read in, you would need the following psuedo code:
        my $IP = #extract from %data; my $userAgent = #extract from %data; my $date = #extract from %data; my $aRequests = $hRequests{$IP}{$userAgent}{$date}; push @$aRequests, \%data;

        Then after you have read in all the requests, loop through %hRequests using the hash keys to select the requests that interest you. The pseudo code would look something like this:

        while (my ($IP, $hUserAgents) = each(%hRequests)) { next if #IP is boring; while (my ($userAgent, $hDates = each(%$hUserAgents)) { next if #user agent is boring; while (my ($date, $aRequests) = each(%$hDates)) { #do something if date is in range #wanted for $IP, $userAgent } } }

        There's a lot of work navigating references to hashes and arrays here. As moritz said previously, studying perldata, perlref (or perlreftut) and perldsc might be well worth your time.

        Best, beth

        For instance, I want to compare IP's in the get and post arrays to see if there are any matches

        Then it would make sense to store it in a hash of hashes, with the IP as the key. Then the searching for common IPs is as simple as iterating over the keys of the first hash, and look them op in the second hash (which doesn't require another iteration). See for example perlfaq4, "How can I get the unique keys from two hashes?" for inspiration.

Re: Adding data to hashes & comparing
by pileofrogs (Priest) on Mar 30, 2009 at 21:11 UTC

    I'm not entirely sure, but I have the feeling you could do better with fewer lists and more hashes. In my experience, the only time a list/array is better than a hash is if the order of the items is important and you really don't have a usable key OR you're making a hash where multiple items have the same key, in which case you might want to use a hash of lists. Certainly, if you're going to search for items in the data structure, use a a hash. If you want to make a list of things and later check if something is in that list, don't make it with an array, make it with a hash where all the values are just 1, or something like that. It's easier and faster. I'm sure that's a massive overgeneralization.

    I don't know how many times I've said "Okay, this should really be a list" and then later turned it into a hash.

    Instead of making a list of all your hashes, make a hash and use the value you want to compare as the key. Then you can just do something like if ( $hash_1{$foo} eq $hash_2{$foo} )....

      Thanks for the tips. I believe I am close, but not there yet. This section of code doesn't seem to be operating as I expect?
      if (%data) { # We have data to process while( my ($key, $value) = each(%data) ) { if($key =~ '%{User-Agent}i\""') { $userAgent = $value; } if($key =~ '%t') { $date = $value; } } $aRequests = $hRequests{$date}{$userAgent}; push @$aRequests, \%data; }


      I am using eclipse and even though I breakpoint and see the hash keys %{User-Agent}i\"" & %t, those if statements are never satisfied. It loops through the entire hash, but the $key never changes from %{Referer}i.

      What am I doing wrong?

        Are you sure it's not that the pattern doesn't match?

Re: Adding data to hashes & comparing
by planetscape (Chancellor) on Mar 31, 2009 at 07:48 UTC