Optimising a search of several thousand files

McDarren has asked for the wisdom of the Perl Monks concerning the following question:

Greetings :)

This is a description of a problem that I have already solved, so I'm not looking so much for an answer - but rather ways to optimise the solution. Not because I necessarily need to - after all the problem is already solved - but more because I'm not sure if the approach I used was a "sensible" one ;)

I have been participating in a game of idlerpg on the OzOrg IRC Network for about 3 years now. Over the recent christmas break, the bot died :(

It was several days before one of the admins was able to fix it, and when he did he restored from a backup that was about 12 months old. This was mildly annoying, because it meant that all the "hard work" we'd all done over the past 12 months was down the drain :/

However, one of the participants of the game has been taking 5 minute dumps of the database for several months now, and we have a backup of the data from the time right before the bot died. This means that rather than everybody having to go backwards 12 months - we can now resume the game from where we were. The challenge is to find the correct dump to restore from.

All the dumps are flat files - tab delimited, and they all reside in a single directory. They are named dump.X, where X is a unix timestamp of the time the dump was taken. There are currently around 57,000 files. Within the files there is one line per-player, and on each line there are several "fields" representing information about that player.

To determine the correct file, I used the following information:

- my player name is McDarren
- prior to the crash, I was on Level 71
- after the game was restored from old data, I was back on Level 68
- Each players nick is the 1st field in each file, and the players level is the 4th

So to determine the correct file to restore from, it's just a matter of sorting the files from newest to oldest, picking my level from each file, and continuing until I find the first file where I was on Level 71.

Here is what I used:

#!/usr/bin/perl -wl
use strict;

my $dir = qw(/home/idlerpg/graphdump);

opendir(DIR, $dir) or die "Cannot open $dir:$!";

my @files = reverse sort readdir(DIR);
my $currlevel = 68;
my $numfiles;
my $numlines;
my $totalfiles = scalar @files;

FILE:
for my $file (@files) {
    open IN, '<', "$dir/$file" or die "Ack!:$!";
    $numfiles++;
    LINE:
    while (<IN>) {
        $numlines++;
        chomp();
        my ($user,$level) = (split /\t/)[0,3];
        next LINE if $user ne 'McDarren';
        next FILE if $level <= $currlevel;
        print "$file $user $level";
        print "Processed $numlines lines in $numfiles files (total fil
+es:$totalfiles)";
        exit;
    }
}
[download]

The output of the above is:

$ time ./foo.pl
dump.1167332700 McDarren 71
Processed 844181 lines in 9030 files (total files:57844)
       48.85 real        10.10 user         0.66 sys
[download]

So now to my question: I was a bit surprised that it took so long to run. So how could I optimise that to run faster?

(This is on a FreeBSD system, P4 2.4GHz with 1GB RAM)

Thanks,
Darren :)

Comment on Optimising a search of several thousand files Select or Download Code

Replies are listed 'Best First'.
Re: Optimising a search of several thousand files by bobf (Monsignor) on Jan 29, 2007 at 04:35 UTC
I'd suggest using a binary search on `@files` to determine when your level change occurred, rather than reading backwards through time by 5 minute increments. Alternatively, you could start by jumping backwards in time by 1 week intervals, then when you find the level change you could move forward in 1 day intervals, then backwards in 1 hour intervals, then forward in 5 minute intervals. You could combine a date filter with either approach. Since you know the bot died around Christmas, it seems silly to test the approximately 8000 files between today and Jan 1, 2007 (or those created prior to about Dec 15, 2006). If you narrow your search window to a week or two, I'd be surprised if any method took very long. :-)	[reply] [d/l]
Re^2: Optimising a search of several thousand files by McDarren (Abbot) on Jan 29, 2007 at 05:59 UTC
heh, yeah.... Simply by skipping any files that are less than 20 days old, I get an answer within 5 seconds :) `$ time ./bobf.pl dump.1167332700 McDarren 71 Processed 367155 lines in 3298 files (total files:57872) 4.41 real 4.20 user 0.14 sys` [download]	[reply] [d/l]
Re: Optimising a search of several thousand files by GrandFather (Saint) on Jan 29, 2007 at 04:18 UTC
If the individual files are of modest size (for some suitable definition of "modest") then slurping the file and using index to search for 'McDarren' may provide a useful speed improvement. DWIM is Perl's answer to Gödel	[reply]
Re^2: Optimising a search of several thousand files by McDarren (Abbot) on Jan 29, 2007 at 05:46 UTC
Thanks, I gave that a go... #!/usr/bin/perl -wl use strict; my $dir = qw(/home/idlerpg/graphdump); opendir(DIR, $dir) or die "Cannot open $dir:$!"; my @files = reverse sort readdir(DIR); my $currlevel = 68; my $numfiles; my $totalfiles = scalar @files; FILE: for my $file (@files) { open IN, '<', "$dir/$file" or die "Ack!:$!"; $numfiles++; undef $/; my $data = <IN>; my $pos = index($data,'McDarren'); $/ = "\n"; next FILE if $pos == -1; seek(IN, $pos,0); chomp(my $line = <IN>); my ($user,$level) = (split /\t/, $line)[0,3]; next FILE if $level <= $currlevel; print "$file $user $level"; print "Processed $numfiles files (total files:$totalfiles)"; exit; } [download] Which produced: `$ time ./gfather.pl dump.1167332700 McDarren 71 Processed 9054 files (total files:57868) 39.15 real 0.84 user 0.75 sys` [download] A slight improvement, but not a great deal. But I had to perldoc for index and seek as I've not used either function before, so my implementation may be a bit wonky ;) However, it gave me another approach to take, which was really the whole point of my post in the first place. (Note that the number of files processed is slightly more, as the dumps continue to accumulate every 5 mins)	[reply] [d/l] [select]
Re^3: Optimising a search of several thousand files by GrandFather (Saint) on Jan 29, 2007 at 08:50 UTC
Urk! What is that `seek` doing in there? You already have the line in $data and the start point in $pos. `(split /\t/,`substr `$data, $pos)[0,3]` ought do the job. It may be faster to constrain split to just finding the first 4 elements, but I'd have to benchmark that. DWIM is Perl's answer to Gödel	[reply] [d/l] [select]
Re^4: Optimising a search of several thousand files by McDarren (Abbot) on Jan 29, 2007 at 09:33 UTC
Re^5: Optimising a search of several thousand files by glasswalk3r (Friar) on Jan 29, 2007 at 13:05 UTC
Re: Optimising a search of several thousand files by Trizor (Pilgrim) on Jan 29, 2007 at 04:23 UTC
I don't see why you had to go through all that work..? Were these full dumps or simply deltas? If they were full dumps and they were all on the same system whose clock was never changed then why couldn't you simply reverse sort the dir and pick the newest. If they are deltas, then wouldn't be easier to simply apply the deltas from oldest to newest?	[reply]
Re^2: Optimising a search of several thousand files by McDarren (Abbot) on Jan 29, 2007 at 04:54 UTC
They are full dumps. Picking the most recent file doesn't work because the dumps continued to be taken after the initial crash and restoration. So what I needed to do was find the last dump that was taken before the crash.	[reply]
Re: Optimising a search of several thousand files by ysth (Canon) on Jan 29, 2007 at 05:59 UTC
For each file, set $/ to "\nMcDarren\t", read a line, set $/ back to "\n", read another line, and decide if this is the right file? Though I'd probably actually do something like: `ls -t\|xargs egrep '^McDarren.*71'\|head -1` I'm looking for a job.	[reply] [d/l]
Re^2: Optimising a search of several thousand files by McDarren (Abbot) on Jan 30, 2007 at 05:33 UTC
I gave that a go. `FILE: for my $file (@files) { my ($stamp) = $file =~ /dump.(.*)/; next FILE if $stamp > ($now - $start); $numfiles++; $/ = "\nMcDarren\t"; open IN, '<', "$dir/$file" or die "Ack!:$!"; my $data = <IN>; $/ = "\n"; $data = <IN>; my ($level) = (split /\t/,$data)[2]; next FILE if $level <= $currlevel; print "$file - $level"; print "Processed $numfiles files (total files:$totalfiles)"; exit; }` [download] Gives: `$ time perl ysth.pl dump.1167332700 - 71 Processed 8764 files (total files:58154) 39.55 real 2.35 user 0.80 sys` [download] Note that I needed to make it skip the most recent 2 days of files as the game was restored yesterday. Regards the xargs/egrep solution - that also works (with a bit of minor tweaking), but isn't any faster than any of the Perl solutions. Thanks, Darren :)	[reply] [d/l] [select]
Re^3: Optimising a search of several thousand files by ysth (Canon) on Jan 30, 2007 at 05:48 UTC
Regards the xargs/egrep solution - that also works (with a bit of minor tweaking), but isn't any faster than any of the Perl solutions. Except in coding time :) I'm looking for a job.	[reply]
Re^2: Optimising a search of several thousand files by diotalevi (Canon) on Jan 29, 2007 at 17:43 UTC
That's clever but now you've committed to having everything prior to `\nMcDarren\t` in RAM. Perhaps that's an issue? ⠤⠤ ⠙⠊⠕⠞⠁⠇⠑⠧⠊	[reply] [d/l]
Re: Optimising a search of several thousand files by toma (Vicar) on Jan 29, 2007 at 16:51 UTC
Most optimization opportunities are in the inner loop. Exit the loop as quickly as possible for most cases. split() and the array cleverness is a little bit slow, so this skips it most of the time: `while (<IN>) { $numlines++; # Remove for more speed next if not /^McDarren\t/; chomp(); my ($user,$level) = (split /\t/)[0,3]; next FILE if $level <= $currlevel; print "$file $user $level"; print "Processed $numlines lines in ". "$numfiles files (total files:$totalfiles)"; exit; }` [download] It should work perfectly the first time! - toma	[reply] [d/l]
Re: Optimising a search of several thousand files by glasswalk3r (Friar) on Jan 29, 2007 at 13:22 UTC
Try running this: #!/usr/bin/perl -wl use strict; my $dir = qw(/home/idlerpg/graphdump); opendir( DIR, $dir ) or die "Cannot open $dir:$!"; my @files = reverse sort readdir(DIR); my $currlevel = 68; my $numfiles; my $numlines; my $totalfiles = scalar @files; my $user; my $level; for my $file (@files) { open IN, '<', "$dir/$file" or die "Ack!: $!"; $numfiles++; while (<IN>) { $user = ''; $level = ''; $numlines++; chomp(); ( $user, $level ) = ( split /\t/o )[ 0, 3 ]; next if $user ne 'McDarren'; last if $level <= $currlevel; print $file, $user, $level; } close(IN); } print "Processed $numlines lines in $numfiles files (total files: $tot +alfiles)"; [download] Of course, I didn't test it, so I'm not sure it will work as you expect. Alceu Rodrigues de Freitas Junior --------------------------------- "You have enemies? Good. That means you've stood up for something, sometime in your life." - Sir Winston Churchill	[reply] [d/l]
Re^2: Optimising a search of several thousand files by McDarren (Abbot) on Jan 30, 2007 at 05:08 UTC
As far as I can see, the only real change you made there from my original code was to add the /o modifier to the pattern match within the split. This doesn't seem to make any noticeable difference.	[reply]
Re^3: Optimising a search of several thousand files by glasswalk3r (Friar) on Jan 31, 2007 at 12:06 UTC
Did you tested it against Benchmark? There are some perl monks that believe that `qr` works better then using `/o` in regex matching (but I never tried to compare both). Anyway, if you use one of them you might get better results since the regex will not be compiled again and again. Updated: sorry, `/o` will do absolute nothing as described below: It should be immediately apparent that using /o on a regex with no interpolation has zero effect. You could use it on a regex that uses interpolation but then you get delayed runtime effects. In general I'd much prefer all of my compilation to occur at compilation time. As a matter of practice string evals are restricted, this works similarly. If you use qr// expressions then you've shunted the regex compilation and syntax checking off to the normal script compilation time. The alternative is that you discover a syntax error at runtime. Yuck. More information about that is here: /o is dead, long live qr//!. By the other hand, `qr` should work. Other suggestion is that you try to play around with `unpack` and see if you cannot parse the line without regular expression. The other things I see you could try is to load the entire file into memory and use `foreach` instead using `while` (of course, this will depend of the file size and available memory that you have). Also you may try to instead of `print $file, $user, $level` put this information into an array to print at the end of the script, to avoid system calls. And as last resources (since these may give you a lot of work but bringing little results), you may want to remove `use strict` and `-w`, as you're not gonna need them if you consider that the code is stable enough. `perlcc -B` may also help, specially because your program is not importing any module. One thing that I didn't understand in your code is why it doesn't close the filehandles. The block labels were also not necessary there and the last `print` statement could be used outside the loop. Alceu Rodrigues de Freitas Junior --------------------------------- "You have enemies? Good. That means you've stood up for something, sometime in your life." - Sir Winston Churchill	[reply] [d/l] [select]
Re: Optimising a search of several thousand files by blahblah (Friar) on Jan 29, 2007 at 16:50 UTC
The solution for the next 12 months: `s/$level/250000/ if ($user eq "McDarren");` [download] ;)	[reply] [d/l]