Re^2: Optimising a search of several thousand files

Thanks, I gave that a go...

#!/usr/bin/perl -wl
use strict;

my $dir = qw(/home/idlerpg/graphdump);

opendir(DIR, $dir) or die "Cannot open $dir:$!";

my @files = reverse sort readdir(DIR);
my $currlevel = 68;
my $numfiles;
my $totalfiles = scalar @files;

FILE:
for my $file (@files) {
    open IN, '<', "$dir/$file" or die "Ack!:$!";
    $numfiles++;
    undef $/;
    my $data = <IN>;
    my $pos = index($data,'McDarren');
    $/ = "\n";
    next FILE if $pos == -1;
    seek(IN, $pos,0);
    chomp(my $line = <IN>);
    my ($user,$level) = (split /\t/, $line)[0,3];
    next FILE if $level <= $currlevel;
    print "$file $user $level";
    print "Processed $numfiles files (total files:$totalfiles)";
    exit;
}
[download]

Which produced:

$ time ./gfather.pl
dump.1167332700 McDarren 71
Processed 9054 files (total files:57868)
       39.15 real         0.84 user         0.75 sys
[download]

A slight improvement, but not a great deal. But I had to perldoc for index and seek as I've not used either function before, so my implementation may be a bit wonky ;)

However, it gave me another approach to take, which was really the whole point of my post in the first place.

(Note that the number of files processed is slightly more, as the dumps continue to accumulate every 5 mins)

Comment on Re^2: Optimising a search of several thousand files Select or Download Code

Replies are listed 'Best First'.
Re^3: Optimising a search of several thousand files by GrandFather (Saint) on Jan 29, 2007 at 08:50 UTC
Urk! What is that `seek` doing in there? You already have the line in $data and the start point in $pos. `(split /\t/,`substr `$data, $pos)[0,3]` ought do the job. It may be faster to constrain split to just finding the first 4 elements, but I'd have to benchmark that. DWIM is Perl's answer to Gödel	[reply] [d/l] [select]
Re^4: Optimising a search of several thousand files by McDarren (Abbot) on Jan 29, 2007 at 09:33 UTC
heh, well I did say that my implementation may have been a bit wonky. Anyway, I re-worked it as you suggested, and interestingly it is now significantly slower... `$ time ./gfather.pl dump.1167332700 McDarren 71 Processed 9098 files (total files:57912) 56.07 real 4.10 user 0.97 sys $ time ./gfather.pl dump.1167332700 McDarren 71 Processed 9098 files (total files:57912) 51.58 real 4.15 user 0.91 sys` [download] The re-worked section of the code looks like so: `... undef $/; my $data = <IN>; my $pos = index($data,'McDarren'); $/ = "\n"; next FILE if $pos == -1; # seek(IN, $pos,0); # chomp(my $line = <IN>); my ($user,$level) = (split /\t/, substr $data, $pos)[0,3]; ...` [download] Adding a limit to the split seems to improve things slightly.. `my ($user,$level) = (split /\t/, (substr $data, $pos),5)[0,3]; $ time ./gfather.pl dump.1167332700 McDarren 71 Processed 9100 files (total files:57914) 47.50 real 0.79 user 0.80 sys` [download] Not a proper benchmark, I realise. Actually, how would I go about benchmarking this?	[reply] [d/l] [select]
Re^5: Optimising a search of several thousand files by glasswalk3r (Friar) on Jan 29, 2007 at 13:05 UTC
Not a proper benchmark, I realise. Actually, how would I go about benchmarking this? See Benchmark and Devel::DProf. Alceu Rodrigues de Freitas Junior --------------------------------- "You have enemies? Good. That means you've stood up for something, sometime in your life." - Sir Winston Churchill	[reply]