Put it in a database...

I worked on this over the holidays. I got something kludged together that generated results in less than 4 minutes using only the ip id's that had been used for 100 or fewer gets. I achieved this by munging the cpan-gets logfile and turning it into a database dump file suitable for import into SQLite or PostgreSQL. Since the problem is fundamentally about munging out the significant data and querying it, using a structured querying language seemed a natural fit. The issue of counting core modules and recursive dependencies is outside the domain of what I chose to focus on.

I've attached my script. It takes the name of the log file as a parameter and generates a phalanx.dump file in the current working directory. You can then use a command like:

  cat phalanx.dump|sqlite3 phalanx.db

To create the database. It isn't particularly elegant, but it gets the job done.

Do you realize that 277,373 of the cpan-gets records are duplicates?

I broke each record out into separate vars for timestamp, ipaddr id, module, module version, and useragent. While doing this for each record, I generated id's for each unique module, module/version, and useragent string. Finally, I generated a unique get id using all of the above as the hash key. I.e. something like:

while (...) {
  ...
  if ($g{$ts, $ip_id, $mod_id, $modver_id, $agent_id}) {
    $duplicates++;
  } else {
    $g{$ts, $ip_id, $mod_id, $modver_id, $agent_id} = ++$gid;
  }
}

After processing cpan-gets, printing the value of $duplicates gives: 277373. I highly suspect cpan-gets has duplicate data. It isn't likely after all that the exact same version of a module was downloaded by the exact same IP and UserAgent at the exact same time...

Here's an SQL query to get the top 100:

SELECT Module.Name, COUNT(Gets.Module_id)
FROM Module
INNER JOIN Gets
   ON Module.ID = Gets.Module_id
INNER JOIN (
  SELECT IPAddr_id, COUNT(*) as Tally FROM Gets
  GROUP BY IPAddr_id HAVING Tally <= 100
) as gip
   ON Gets.IPAddr_id = gip.IPAddr_id
GROUP BY Module.Name
ORDER BY COUNT(Gets.Module_id) DESC;

I'd never used SQLite directly before, so I wasn't sure how best to twiddle the indexes to give it reasonable performance. I was very disappointed with the performance of queries using simple joins. Without indexes, the performance is awful. Using the sub-select makes it churn long enough for the powersaving setting on my laptop to kick into hibernation. Probably yanking the sub-select out into a temporary table and putting indexes on that would give better results, but I ran out of time. So, if you're using SQLite, you might want to run the simpified version that follows.

SELECT Module.Name, COUNT(Gets.Module_id)
FROM Module
INNER JOIN Gets
   ON Module.ID = Gets.Module_id
GROUP BY Module.Name
ORDER BY COUNT(Gets.Module_id) DESC;

The nice thing about putting the data into a database, is that it then becomes simple for people to tinker with and generate their own queries...

Code follows...

#!/usr/bin/perl -s -w

our (%month);
@month{qw{Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec}} = (1..12);
my ($id, $ts, $path, $agent, $ext, $module, $version);

my (%ip, %a, %m, %mv, %g);
my ($aid, $mid, $mvid, $gid);
$aid=$mid=$mvid=$gid=0;

my $count=0;
my $dups=0;
while (<>) {
    next unless m/^(\S+) (\S+) (\S+) "(.+)"$/;
    ($id,$ts,$path,$agent) = ($1,$2,$3,$4);

    $agent=~s/'/''/g;
    next unless $path =~ m/\.(tar\.gz|par|tgz|zip|pm|pm\.gz)$/;
    $ext = $1;

    # DD/MMM/YYYY:HH:mm:SS => YYYY/MM/DD HH:mm:SS
    $ts =~ s/(\d{2})\/(\w{3})\/(\d{4}):(.+)$/$3\/$month{$2}\/$1 $4/;

    # /authors/id/S/SP/SPACEMAN/WebJay.pm
    ($module) = $path =~ m/(^\/+)\Q.$ext\E$/;
    ($version) = $module =~ m/(?:-(^-+))$/;
    $module =~ s/(?:-(^-+))$//;
    $module =~ s/\.pm$//;
    $version ||= 'undef';

    $ip{$id}||=$id;
    $a{$agent}||=++$aid;
    $m{$module}||=++$mid;
    $mv{$m{$module},$version}||=++$mvid;
    if ($g{$ts,$id,$mid,$mvid,$aid}) {
       $dups++;
    } else {
      $g{$ts,$id,$mid,$mvid,$aid}=++$gid;
    }

    if ($count++>=1000) {
	$count=0;
	print '.';
    }
}
       print "$dups duplicates exist\n";

open OUT, '>phalanx.db';

print OUT <<EOF;
BEGIN TRANSACTION;
CREATE TABLE IPAddr (ID INTEGER PRIMARY KEY);
EOF
print OUT <<"EOF"  for sort {$a<=>$b} keys %ip;
INSERT INTO "IPAddr" VALUES($_);
EOF

print OUT "CREATE TABLE Agent (ID INTEGER PRIMARY KEY, Desc VARCHAR(256));\n";
print OUT <<"EOF"  for sort {$a{$a}<=>$a{$b}} keys %a;
INSERT INTO "Agent" VALUES($a{$_}, '$_');
EOF

print OUT "CREATE TABLE Module (ID INTEGER PRIMARY KEY, Name VARCHAR(256));\n";
print OUT <<"EOF"  for sort {$m{$a}<=>$m{$b}} keys %m;
INSERT INTO "Module" VALUES($m{$_}, '$_');
EOF

my $chr28=chr(28);

print OUT "CREATE TABLE ModVer (ID INTEGER PRIMARY KEY, Module_id INTEGER,\n     Version VARCHAR(16));\n";
for my $key (sort {$mv{$a}<=>$mv{$b}} keys %mv) {
    my ($mid,$version) = split /$chr28/, $key;
    my ($mvid) = $mv{$key};
    print OUT <<"EOF";
INSERT INTO "ModVer" VALUES($mvid, $mid, '$version');
EOF
}

#    $g{$ts,$id,$mid,$mvid,$aid}||=++$gid;
print OUT "CREATE TABLE Gets (ID INTEGER PRIMARY KEY, ts TIMESTAMP, IPAddr_id INTEGER,\n   Module_id INTEGER, ModVer_id INTEGER, Agent_id INTEGER);\n";
for my $key (sort {$g{$a}<=>$g{$b}} keys %g) {
    my ($ts,$id,$mid,$mvid,$aid) = split /$chr28/, $key;
    my ($gid) = $g{$key};
    print OUT <<"EOF";
INSERT INTO "Gets" VALUES($gid, '$ts', $id, $mid, $mvid, $aid);
EOF
}

print OUT <<EOF;
CREATE INDEX fk_module_id ON Gets (Module_id);
COMMIT;
EOF

__END__

SELECT Module.Name, COUNT(Gets.Module_ID)
FROM Module
INNER JOIN Gets
   ON Module.ID = Gets.Module_id
INNER JOIN (
  SELECT IPAddr_id, COUNT(*) as Tally FROM Gets
  GROUP BY IPAddr_id HAVING Tally <= 100
) as gip
   ON Gets.IPAddr_id = gip.IPAddr_id
GROUP BY Module.Name
ORDER BY COUNT(Gets.Module_ID) ASC;

SELECT IPAddr_ID, COUNT(*) FROM Gets GROUP BY IPAddr_ID ORDER BY IPAddr_id ASC;

Comment on Put it in a database...