string occurences

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: string occurences by lhoward (Vicar) on Jun 12, 2001 at 21:13 UTC
Persoally, I'd use a hash in-memory. a 400MB logfile with only part of the data being urls, with a fair number of duplicates will probably not use more than 200MB of RAM in a hash... but (as you suspected) if you don't have much ram to spare you'll need to use some sort of disk-based storage (tied-hash to a DBM file, etc...). Make the URL the hash key, and the hash value the # of occurrences. That way you'll only have to read the logfile once. `open F,"<squid_logfile" or die "$!"; my %counts; #tie the file to a DB hash or something similar if memory is a concern while(<F>){ my $url=.... #extract url from a line of data and put it in $url $counts{$url}=0 if !defined $counts{$url}; $counts{$url}++; } close F; #do something with %counts to produce your report.` [download] ----------------------- added later Since I always run w/ warnings and strict on I can't get away w/ the "an undefined hash value is treated as 0 numericly" trick. Also, because of the way I am using the hash, the "defined" check is good enough, because there will not be a hash entry that is undef.	[reply] [d/l]
Re: Re: string occurences by MeowChow (Vicar) on Jun 12, 2001 at 21:20 UTC
`$counts{$url}=0 if !defined $counts{$url};` [download] That line is unnecessary. Autoincrement on an undefined hash key autovivifies the key and sets the value to 1. MeowChow s aamecha.s a..a\u$&owag.print	[reply] [d/l]
A reply falls below the community's threshold of quality. You may see it by logging in.
Re: string occurences by Sifmole (Chaplain) on Jun 12, 2001 at 21:30 UTC
Are you working on a Unix system? If so you might want to use some of the available Unix tools. You could use "cut" to splice out the URL, feed the results to a file which could then "sort". Once that sort is completed, you could then easily count of the occurances of each URL without having to store a large number of lines or create many temp files. After the sort you could do something like: `# Untested my $current = ''; my $count = 0; while (<>) { if ( ($current ne $_) && ($current ne '') ) { print "$current :: $count \n"; $count = 0; $current = $_; } else { $count++; } }` [download] You would invoke at the command line as `./foo.pl < sorted.file > file.count` Since the file is already sorted for you and contains only the URL, all of each URL will be grouped together. Therefor, once a URL changes you will know that you are done counting a particular URL. No need to store in memory any more than the current URL and the current count; Once the URL changes you dump out the count and move on to the next one.	[reply] [d/l] [select]
Re: Re: string occurences by runrig (Abbot) on Jun 12, 2001 at 22:06 UTC
As long as you're going with a shell solution, you could just use: `cat <file(s)> \| cut ... \| sort \| uniq -c`	[reply] [d/l]
Re: Re: Re: string occurences by Sifmole (Chaplain) on Jun 12, 2001 at 22:19 UTC
Thanks! I now know about "uniq -c", I did not before.	[reply]
Re: Re: string occurences by Anonymous Monk on Jun 13, 2001 at 00:00 UTC
PERFECT!, this is exactly what I was looking for, thank u very much. I did not want to use array's as it begin thrashing my VM. This seems the best way. -burhan	[reply]
Re: string occurences by ckohl1 (Hermit) on Jun 12, 2001 at 22:45 UTC
I think that this might be close to what you need: #!f:/perl/bin/perl.exe use strict; my $LogFile='var/log/squid/access.log'; my %urls; my ( $internal_ip, $link_visited, $site_visited, $link_date, $human_da +te, $minute, $hour, $cached_line ); open (FH,"<$LogFile") or die "Could not open $LogFile"; while(<FH>){ if(!/^#/){ ($internal_ip, $link_visited, $site_visited, $link_date, $huma +n_date, $minute, $hour, $cached_line)=split; if ( defined ( $urls{$link_visited} ) ) { $urls{$link_visited}++; } else { $urls{$link_visited}=0; } } } close(FH); foreach my $key ( sort ( keys ( %urls ) ) ) { print "$key:\t$urls{$key}\n"; } exit; [download] Update:Of course after I post, I see my thoughts were shared by others (lhoward, ...). Chris 'You can't get there from here.'	[reply] [d/l]
Re: string occurences by tigervamp (Friar) on Jun 13, 2001 at 03:45 UTC
If you are running Unix,Linux,etc., and the url is in say the 3rd field (delimited by spaces as you said), then you can use simple shell tools, such as: `cat logfile\|awk {'print $3'}\|sort\|uniq -c > counts.txt` [download] This is the best approach, however, if your OS does not support such great tools, or you just want a Perl solution, this will work: `$occurences{(split)[2]}++ while (<>); foreach $url (keys %occurences) { print "$occurences{$url}\t$url\n"; }` [download] change the subscript accordingly, then run the program: `<prompt> perl -w programname logfiles` [download] tigervamp	[reply] [d/l] [select]