Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello brothers, I have a massive squid log file (400MB). I need to count the number of occurences of each url string within that log. The format of the log basicaly is a series of lines, each line has a url, timestamp, bytes etc separated by spaces. Parsing that is no problem, however, what would be the most efficient method of counting the number of occurences of each url. I take it arrays are out the question, but how would this work by using temp files without making a mess?? Please let me know...my job is the line! Thanks -Burhan

Replies are listed 'Best First'.
Re: string occurences
by lhoward (Vicar) on Jun 12, 2001 at 21:13 UTC
    Persoally, I'd use a hash in-memory. a 400MB logfile with only part of the data being urls, with a fair number of duplicates will probably not use more than 200MB of RAM in a hash... but (as you suspected) if you don't have much ram to spare you'll need to use some sort of disk-based storage (tied-hash to a DBM file, etc...). Make the URL the hash key, and the hash value the # of occurrences. That way you'll only have to read the logfile once.
    open F,"<squid_logfile" or die "$!"; my %counts; #tie the file to a DB hash or something similar if memory is a concern while(<F>){ my $url=.... #extract url from a line of data and put it in $url $counts{$url}=0 if !defined $counts{$url}; $counts{$url}++; } close F; #do something with %counts to produce your report.
    -----------------------

    added later

    Since I always run w/ warnings and strict on I can't get away w/ the "an undefined hash value is treated as 0 numericly" trick.

    Also, because of the way I am using the hash, the "defined" check is good enough, because there will not be a hash entry that is undef.

      $counts{$url}=0 if !defined $counts{$url};
      That line is unnecessary. Autoincrement on an undefined hash key autovivifies the key and sets the value to 1.
         MeowChow                                   
                     s aamecha.s a..a\u$&owag.print
      A reply falls below the community's threshold of quality. You may see it by logging in.
Re: string occurences
by Sifmole (Chaplain) on Jun 12, 2001 at 21:30 UTC
    Are you working on a Unix system? If so you might want to use some of the available Unix tools. You could use "cut" to splice out the URL, feed the results to a file which could then "sort". Once that sort is completed, you could then easily count of the occurances of each URL without having to store a large number of lines or create many temp files. After the sort you could do something like:
    # Untested my $current = ''; my $count = 0; while (<>) { if ( ($current ne $_) && ($current ne '') ) { print "$current :: $count \n"; $count = 0; $current = $_; } else { $count++; } }
    You would invoke at the command line as ./foo.pl < sorted.file > file.count

    Since the file is already sorted for you and contains only the URL, all of each URL will be grouped together. Therefor, once a URL changes you will know that you are done counting a particular URL. No need to store in memory any more than the current URL and the current count; Once the URL changes you dump out the count and move on to the next one.

      As long as you're going with a shell solution, you could just use:

      cat <file(s)> | cut ... | sort | uniq -c

        Thanks! I now know about "uniq -c", I did not before.
      PERFECT!, this is exactly what I was looking for, thank u very much. I did not want to use array's as it begin thrashing my VM. This seems the best way. -burhan
Re: string occurences
by ckohl1 (Hermit) on Jun 12, 2001 at 22:45 UTC
    I think that this might be close to what you need:

    #!f:/perl/bin/perl.exe use strict; my $LogFile='var/log/squid/access.log'; my %urls; my ( $internal_ip, $link_visited, $site_visited, $link_date, $human_da +te, $minute, $hour, $cached_line ); open (FH,"<$LogFile") or die "Could not open $LogFile"; while(<FH>){ if(!/^#/){ ($internal_ip, $link_visited, $site_visited, $link_date, $huma +n_date, $minute, $hour, $cached_line)=split; if ( defined ( $urls{$link_visited} ) ) { $urls{$link_visited}++; } else { $urls{$link_visited}=0; } } } close(FH); foreach my $key ( sort ( keys ( %urls ) ) ) { print "$key:\t$urls{$key}\n"; } exit;

    Update:Of course after I post, I see my thoughts were shared by others (lhoward, ...).


    Chris
    'You can't get there from here.'
Re: string occurences
by tigervamp (Friar) on Jun 13, 2001 at 03:45 UTC
    If you are running Unix,Linux,etc., and the url is in say the 3rd field (delimited by spaces as you said), then you can use simple shell tools, such as:
    cat logfile|awk {'print $3'}|sort|uniq -c > counts.txt

    This is the best approach, however, if your OS does not support such great tools, or you just want a Perl solution, this will work:

    $occurences{(split)[2]}++ while (<>); foreach $url (keys %occurences) { print "$occurences{$url}\t$url\n"; }
    change the subscript accordingly, then run the program:
    <prompt> perl -w programname logfiles
    tigervamp