in reply to Re: Optimising a search of several thousand files
in thread Optimising a search of several thousand files

Thanks, I gave that a go...
#!/usr/bin/perl -wl use strict; my $dir = qw(/home/idlerpg/graphdump); opendir(DIR, $dir) or die "Cannot open $dir:$!"; my @files = reverse sort readdir(DIR); my $currlevel = 68; my $numfiles; my $totalfiles = scalar @files; FILE: for my $file (@files) { open IN, '<', "$dir/$file" or die "Ack!:$!"; $numfiles++; undef $/; my $data = <IN>; my $pos = index($data,'McDarren'); $/ = "\n"; next FILE if $pos == -1; seek(IN, $pos,0); chomp(my $line = <IN>); my ($user,$level) = (split /\t/, $line)[0,3]; next FILE if $level <= $currlevel; print "$file $user $level"; print "Processed $numfiles files (total files:$totalfiles)"; exit; }
Which produced:
$ time ./gfather.pl dump.1167332700 McDarren 71 Processed 9054 files (total files:57868) 39.15 real 0.84 user 0.75 sys
A slight improvement, but not a great deal. But I had to perldoc for index and seek as I've not used either function before, so my implementation may be a bit wonky ;)

However, it gave me another approach to take, which was really the whole point of my post in the first place.

(Note that the number of files processed is slightly more, as the dumps continue to accumulate every 5 mins)

Replies are listed 'Best First'.
Re^3: Optimising a search of several thousand files
by GrandFather (Saint) on Jan 29, 2007 at 08:50 UTC

    Urk! What is that seek doing in there? You already have the line in $data and the start point in $pos. (split /\t/,substr $data, $pos)[0,3] ought do the job. It may be faster to constrain split to just finding the first 4 elements, but I'd have to benchmark that.


    DWIM is Perl's answer to Gödel
      heh, well I did say that my implementation may have been a bit wonky.

      Anyway, I re-worked it as you suggested, and interestingly it is now significantly slower...

      $ time ./gfather.pl dump.1167332700 McDarren 71 Processed 9098 files (total files:57912) 56.07 real 4.10 user 0.97 sys $ time ./gfather.pl dump.1167332700 McDarren 71 Processed 9098 files (total files:57912) 51.58 real 4.15 user 0.91 sys
      The re-worked section of the code looks like so:
      ... undef $/; my $data = <IN>; my $pos = index($data,'McDarren'); $/ = "\n"; next FILE if $pos == -1; # seek(IN, $pos,0); # chomp(my $line = <IN>); my ($user,$level) = (split /\t/, substr $data, $pos)[0,3]; ...
      Adding a limit to the split seems to improve things slightly..
      my ($user,$level) = (split /\t/, (substr $data, $pos),5)[0,3]; $ time ./gfather.pl dump.1167332700 McDarren 71 Processed 9100 files (total files:57914) 47.50 real 0.79 user 0.80 sys
      Not a proper benchmark, I realise. Actually, how would I go about benchmarking this?
        Not a proper benchmark, I realise. Actually, how would I go about benchmarking this?

        See Benchmark and Devel::DProf.

        Alceu Rodrigues de Freitas Junior
        ---------------------------------
        "You have enemies? Good. That means you've stood up for something, sometime in your life." - Sir Winston Churchill