in reply to Help update the Phalanx 100
I worked on this over the holidays. I got something kludged together that generated results in less than 4 minutes using only the ip id's that had been used for 100 or fewer gets. I achieved this by munging the cpan-gets logfile and turning it into a database dump file suitable for import into SQLite or PostgreSQL. Since the problem is fundamentally about munging out the significant data and querying it, using a structured querying language seemed a natural fit. The issue of counting core modules and recursive dependencies is outside the domain of what I chose to focus on.
I've attached my script. It takes the name of the log file as a parameter and generates a phalanx.dump file in the current working directory. You can then use a command like:
cat phalanx.dump|sqlite3 phalanx.db
To create the database. It isn't particularly elegant, but it gets the job done.
Do you realize that 277,373 of the cpan-gets records are duplicates?
I broke each record out into separate vars for timestamp, ipaddr id, module, module version, and useragent. While doing this for each record, I generated id's for each unique module, module/version, and useragent string. Finally, I generated a unique get id using all of the above as the hash key. I.e. something like:
while (...) {
...
if ($g{$ts, $ip_id, $mod_id, $modver_id, $agent_id}) {
$duplicates++;
} else {
$g{$ts, $ip_id, $mod_id, $modver_id, $agent_id} = ++$gid;
}
}
After processing cpan-gets, printing the value of $duplicates gives: 277373. I highly suspect cpan-gets has duplicate data. It isn't likely after all that the exact same version of a module was downloaded by the exact same IP and UserAgent at the exact same time...
Here's an SQL query to get the top 100:
SELECT Module.Name, COUNT(Gets.Module_id) FROM Module INNER JOIN Gets ON Module.ID = Gets.Module_id INNER JOIN ( SELECT IPAddr_id, COUNT(*) as Tally FROM Gets GROUP BY IPAddr_id HAVING Tally <= 100 ) as gip ON Gets.IPAddr_id = gip.IPAddr_id GROUP BY Module.Name ORDER BY COUNT(Gets.Module_id) DESC;
I'd never used SQLite directly before, so I wasn't sure how best to twiddle the indexes to give it reasonable performance. I was very disappointed with the performance of queries using simple joins. Without indexes, the performance is awful. Using the sub-select makes it churn long enough for the powersaving setting on my laptop to kick into hibernation. Probably yanking the sub-select out into a temporary table and putting indexes on that would give better results, but I ran out of time. So, if you're using SQLite, you might want to run the simpified version that follows.
SELECT Module.Name, COUNT(Gets.Module_id) FROM Module INNER JOIN Gets ON Module.ID = Gets.Module_id GROUP BY Module.Name ORDER BY COUNT(Gets.Module_id) DESC;
The nice thing about putting the data into a database, is that it then becomes simple for people to tinker with and generate their own queries...
Code follows...
#!/usr/bin/perl -s -w
our (%month);
@month{qw{Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec}} = (1..12);
my ($id, $ts, $path, $agent, $ext, $module, $version);
my (%ip, %a, %m, %mv, %g);
my ($aid, $mid, $mvid, $gid);
$aid=$mid=$mvid=$gid=0;
my $count=0;
my $dups=0;
while (<>) {
next unless m/^(\S+) (\S+) (\S+) "(.+)"$/;
($id,$ts,$path,$agent) = ($1,$2,$3,$4);
$agent=~s/'/''/g;
next unless $path =~ m/\.(tar\.gz|par|tgz|zip|pm|pm\.gz)$/;
$ext = $1;
# DD/MMM/YYYY:HH:mm:SS => YYYY/MM/DD HH:mm:SS
$ts =~ s/(\d{2})\/(\w{3})\/(\d{4}):(.+)$/$3\/$month{$2}\/$1 $4/;
# /authors/id/S/SP/SPACEMAN/WebJay.pm
($module) = $path =~ m/(^\/+)\Q.$ext\E$/;
($version) = $module =~ m/(?:-(^-+))$/;
$module =~ s/(?:-(^-+))$//;
$module =~ s/\.pm$//;
$version ||= 'undef';
$ip{$id}||=$id;
$a{$agent}||=++$aid;
$m{$module}||=++$mid;
$mv{$m{$module},$version}||=++$mvid;
if ($g{$ts,$id,$mid,$mvid,$aid}) {
$dups++;
} else {
$g{$ts,$id,$mid,$mvid,$aid}=++$gid;
}
if ($count++>=1000) {
$count=0;
print '.';
}
}
print "$dups duplicates exist\n";
open OUT, '>phalanx.db';
print OUT <<EOF;
BEGIN TRANSACTION;
CREATE TABLE IPAddr (ID INTEGER PRIMARY KEY);
EOF
print OUT <<"EOF" for sort {$a<=>$b} keys %ip;
INSERT INTO "IPAddr" VALUES($_);
EOF
print OUT "CREATE TABLE Agent (ID INTEGER PRIMARY KEY, Desc VARCHAR(256));\n";
print OUT <<"EOF" for sort {$a{$a}<=>$a{$b}} keys %a;
INSERT INTO "Agent" VALUES($a{$_}, '$_');
EOF
print OUT "CREATE TABLE Module (ID INTEGER PRIMARY KEY, Name VARCHAR(256));\n";
print OUT <<"EOF" for sort {$m{$a}<=>$m{$b}} keys %m;
INSERT INTO "Module" VALUES($m{$_}, '$_');
EOF
my $chr28=chr(28);
print OUT "CREATE TABLE ModVer (ID INTEGER PRIMARY KEY, Module_id INTEGER,\n Version VARCHAR(16));\n";
for my $key (sort {$mv{$a}<=>$mv{$b}} keys %mv) {
my ($mid,$version) = split /$chr28/, $key;
my ($mvid) = $mv{$key};
print OUT <<"EOF";
INSERT INTO "ModVer" VALUES($mvid, $mid, '$version');
EOF
}
# $g{$ts,$id,$mid,$mvid,$aid}||=++$gid;
print OUT "CREATE TABLE Gets (ID INTEGER PRIMARY KEY, ts TIMESTAMP, IPAddr_id INTEGER,\n Module_id INTEGER, ModVer_id INTEGER, Agent_id INTEGER);\n";
for my $key (sort {$g{$a}<=>$g{$b}} keys %g) {
my ($ts,$id,$mid,$mvid,$aid) = split /$chr28/, $key;
my ($gid) = $g{$key};
print OUT <<"EOF";
INSERT INTO "Gets" VALUES($gid, '$ts', $id, $mid, $mvid, $aid);
EOF
}
print OUT <<EOF;
CREATE INDEX fk_module_id ON Gets (Module_id);
COMMIT;
EOF
__END__
SELECT Module.Name, COUNT(Gets.Module_ID)
FROM Module
INNER JOIN Gets
ON Module.ID = Gets.Module_id
INNER JOIN (
SELECT IPAddr_id, COUNT(*) as Tally FROM Gets
GROUP BY IPAddr_id HAVING Tally <= 100
) as gip
ON Gets.IPAddr_id = gip.IPAddr_id
GROUP BY Module.Name
ORDER BY COUNT(Gets.Module_ID) ASC;
SELECT IPAddr_ID, COUNT(*) FROM Gets GROUP BY IPAddr_ID ORDER BY IPAddr_id ASC;
|
|---|