in reply to string occurences

Persoally, I'd use a hash in-memory. a 400MB logfile with only part of the data being urls, with a fair number of duplicates will probably not use more than 200MB of RAM in a hash... but (as you suspected) if you don't have much ram to spare you'll need to use some sort of disk-based storage (tied-hash to a DBM file, etc...). Make the URL the hash key, and the hash value the # of occurrences. That way you'll only have to read the logfile once.
open F,"<squid_logfile" or die "$!"; my %counts; #tie the file to a DB hash or something similar if memory is a concern while(<F>){ my $url=.... #extract url from a line of data and put it in $url $counts{$url}=0 if !defined $counts{$url}; $counts{$url}++; } close F; #do something with %counts to produce your report.
-----------------------

added later

Since I always run w/ warnings and strict on I can't get away w/ the "an undefined hash value is treated as 0 numericly" trick.

Also, because of the way I am using the hash, the "defined" check is good enough, because there will not be a hash entry that is undef.

Replies are listed 'Best First'.
Re: Re: string occurences
by MeowChow (Vicar) on Jun 12, 2001 at 21:20 UTC
    $counts{$url}=0 if !defined $counts{$url};
    That line is unnecessary. Autoincrement on an undefined hash key autovivifies the key and sets the value to 1.
       MeowChow                                   
                   s aamecha.s a..a\u$&owag.print
    A reply falls below the community's threshold of quality. You may see it by logging in.