comment on

Greetings :)

This is a description of a problem that I have already solved, so I'm not looking so much for an answer - but rather ways to optimise the solution. Not because I necessarily need to - after all the problem is already solved - but more because I'm not sure if the approach I used was a "sensible" one ;)

I have been participating in a game of idlerpg on the OzOrg IRC Network for about 3 years now. Over the recent christmas break, the bot died :(

It was several days before one of the admins was able to fix it, and when he did he restored from a backup that was about 12 months old. This was mildly annoying, because it meant that all the "hard work" we'd all done over the past 12 months was down the drain :/

However, one of the participants of the game has been taking 5 minute dumps of the database for several months now, and we have a backup of the data from the time right before the bot died. This means that rather than everybody having to go backwards 12 months - we can now resume the game from where we were. The challenge is to find the correct dump to restore from.

All the dumps are flat files - tab delimited, and they all reside in a single directory. They are named dump.X, where X is a unix timestamp of the time the dump was taken. There are currently around 57,000 files. Within the files there is one line per-player, and on each line there are several "fields" representing information about that player.

To determine the correct file, I used the following information:

- my player name is McDarren
- prior to the crash, I was on Level 71
- after the game was restored from old data, I was back on Level 68
- Each players nick is the 1st field in each file, and the players level is the 4th

So to determine the correct file to restore from, it's just a matter of sorting the files from newest to oldest, picking my level from each file, and continuing until I find the first file where I was on Level 71.

Here is what I used:

#!/usr/bin/perl -wl
use strict;

my $dir = qw(/home/idlerpg/graphdump);

opendir(DIR, $dir) or die "Cannot open $dir:$!";

my @files = reverse sort readdir(DIR);
my $currlevel = 68;
my $numfiles;
my $numlines;
my $totalfiles = scalar @files;

FILE:
for my $file (@files) {
    open IN, '<', "$dir/$file" or die "Ack!:$!";
    $numfiles++;
    LINE:
    while (<IN>) {
        $numlines++;
        chomp();
        my ($user,$level) = (split /\t/)[0,3];
        next LINE if $user ne 'McDarren';
        next FILE if $level <= $currlevel;
        print "$file $user $level";
        print "Processed $numlines lines in $numfiles files (total fil
+es:$totalfiles)";
        exit;
    }
}
[download]

The output of the above is:

$ time ./foo.pl
dump.1167332700 McDarren 71
Processed 844181 lines in 9030 files (total files:57844)
       48.85 real        10.10 user         0.66 sys
[download]

So now to my question: I was a bit surprised that it took so long to run. So how could I optimise that to run faster?

(This is on a FreeBSD system, P4 2.4GHz with 1GB RAM)

Thanks,
Darren :)

In reply to Optimising a search of several thousand files by McDarren

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.