Re: string occurences

Persoally, I'd use a hash in-memory. a 400MB logfile with only part of the data being urls, with a fair number of duplicates will probably not use more than 200MB of RAM in a hash... but (as you suspected) if you don't have much ram to spare you'll need to use some sort of disk-based storage (tied-hash to a DBM file, etc...). Make the URL the hash key, and the hash value the # of occurrences. That way you'll only have to read the logfile once.

open F,"<squid_logfile" or die "$!";
my %counts;
#tie the file to a DB hash or something similar if memory is a concern
while(<F>){
  my $url=.... #extract url from a line of data and put it in $url
  $counts{$url}=0 if !defined $counts{$url};
  $counts{$url}++;
}
close F;
#do something with %counts to produce your report.
[download]

-----------------------

added later

Since I always run w/ warnings and strict on I can't get away w/ the "an undefined hash value is treated as 0 numericly" trick.

Also, because of the way I am using the hash, the "defined" check is good enough, because there will not be a hash entry that is undef.

Comment on Re: string occurences Download Code

Replies are listed 'Best First'.
Re: Re: string occurences by MeowChow (Vicar) on Jun 12, 2001 at 21:20 UTC
`$counts{$url}=0 if !defined $counts{$url};` [download] That line is unnecessary. Autoincrement on an undefined hash key autovivifies the key and sets the value to 1. MeowChow s aamecha.s a..a\u$&owag.print	[reply] [d/l]
A reply falls below the community's threshold of quality. You may see it by logging in.