in reply to Optimising a search of several thousand files

If the individual files are of modest size (for some suitable definition of "modest") then slurping the file and using index to search for 'McDarren' may provide a useful speed improvement.


DWIM is Perl's answer to Gödel
  • Comment on Re: Optimising a search of several thousand files

Replies are listed 'Best First'.
Re^2: Optimising a search of several thousand files
by McDarren (Abbot) on Jan 29, 2007 at 05:46 UTC
    Thanks, I gave that a go...
    #!/usr/bin/perl -wl use strict; my $dir = qw(/home/idlerpg/graphdump); opendir(DIR, $dir) or die "Cannot open $dir:$!"; my @files = reverse sort readdir(DIR); my $currlevel = 68; my $numfiles; my $totalfiles = scalar @files; FILE: for my $file (@files) { open IN, '<', "$dir/$file" or die "Ack!:$!"; $numfiles++; undef $/; my $data = <IN>; my $pos = index($data,'McDarren'); $/ = "\n"; next FILE if $pos == -1; seek(IN, $pos,0); chomp(my $line = <IN>); my ($user,$level) = (split /\t/, $line)[0,3]; next FILE if $level <= $currlevel; print "$file $user $level"; print "Processed $numfiles files (total files:$totalfiles)"; exit; }
    Which produced:
    $ time ./gfather.pl dump.1167332700 McDarren 71 Processed 9054 files (total files:57868) 39.15 real 0.84 user 0.75 sys
    A slight improvement, but not a great deal. But I had to perldoc for index and seek as I've not used either function before, so my implementation may be a bit wonky ;)

    However, it gave me another approach to take, which was really the whole point of my post in the first place.

    (Note that the number of files processed is slightly more, as the dumps continue to accumulate every 5 mins)

      Urk! What is that seek doing in there? You already have the line in $data and the start point in $pos. (split /\t/,substr $data, $pos)[0,3] ought do the job. It may be faster to constrain split to just finding the first 4 elements, but I'd have to benchmark that.


      DWIM is Perl's answer to Gödel
        heh, well I did say that my implementation may have been a bit wonky.

        Anyway, I re-worked it as you suggested, and interestingly it is now significantly slower...

        $ time ./gfather.pl dump.1167332700 McDarren 71 Processed 9098 files (total files:57912) 56.07 real 4.10 user 0.97 sys $ time ./gfather.pl dump.1167332700 McDarren 71 Processed 9098 files (total files:57912) 51.58 real 4.15 user 0.91 sys
        The re-worked section of the code looks like so:
        ... undef $/; my $data = <IN>; my $pos = index($data,'McDarren'); $/ = "\n"; next FILE if $pos == -1; # seek(IN, $pos,0); # chomp(my $line = <IN>); my ($user,$level) = (split /\t/, substr $data, $pos)[0,3]; ...
        Adding a limit to the split seems to improve things slightly..
        my ($user,$level) = (split /\t/, (substr $data, $pos),5)[0,3]; $ time ./gfather.pl dump.1167332700 McDarren 71 Processed 9100 files (total files:57914) 47.50 real 0.79 user 0.80 sys
        Not a proper benchmark, I realise. Actually, how would I go about benchmarking this?