need more efficient way to write this query script

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.

Re: need more efficient way to write this query script
by Ovid (Cardinal) on Aug 15, 2002 at 23:42 UTC

Just off the top of my head (read: untested), I'd use a hash. If your IDs are truly that simple to parse, try this:

chomp ( my @id = <ID_LIST> );
my %ids;
@id{@id} = (1) x @id;

# dropped the chomp because you added \n back in
while ( my $data = <DB> ) {
    my ($curr_id) = $data =~ /(user\d\d-\d\d\.dept\d\d)/;
    if ( $curr_id and exists $id{ $curr_id } ) {
        print OUT $data;
    }
}
[download]

The problem you were having is that nested loops tend to not scale well. If your while loop iterates over 10 elements and you have 10 elements in @id, then you have a total of 100 (10 x 10) iterations. Bump that up to 100 items in the while loop and 100 elements in @id and you have 100 x 100 iterations: 10,000. Whenever you see something like that, the trick is to figure out how to remove one of the loops. Even if my single lookup on the hash key were horribly inefficient (for example, if my regex to extract the id had much backtracking), that would still probably scale better than a nested loop.

Cheers,
Ovid

Join the Perlmonks Setiathome Group or just click on the the link and check out our stats.

[reply]
[d/l]

Re: need more efficient way to write this query script
by RMGir (Prior) on Aug 15, 2002 at 23:38 UTC

You'd have to encode multiple values for each key, but you could do that by comma separating them and then doing a split on the value the DB returns. It'd still be MUCH faster than looping over a flat file.

Check out the documentation for AnyDBM_File; it's a bit odd to use tied hashes at first, but you get used to it.
--
Mike

[reply]

Re: need more efficient way to write this query script
by grep (Monsignor) on Aug 15, 2002 at 23:26 UTC

index

$data =~ /$id/

if (index($data,$id)+1){
[download]

UPDATE:Ovid's and RMGir's soutions are much more scalable. My suggestion is more of use index instead of regexes.

Benchmark

Benchmark: timing 10000 iterations of index, regex...
     index:  3 wallclock secs ( 3.52 usr +  0.00 sys =  3.52 CPU) @ 28
+40.91/s (n=10000)
     regex: 92 wallclock secs (91.44 usr +  0.00 sys = 91.44 CPU) @ 10
+9.36/s (n=10000)
[download]

Code for benchmark

sub regex {
    foreach my $data (@db) {
        foreach my $id (@id) {
            if ($data =~ /$id/) {1}
        }
    }
}

sub index {
    foreach my $data (@db) {
        foreach my $id (@id) {
            if (index($data,$id)+1) {1}
        }
    }
}
[download]

UPDATE 2: to reiterate Ovid's point - The benchmark of my nested loop solution converges with Ovid's solution with just a hundred entries in each of your data sets

grep

<hand_wave> These are not the monks you're looking for </hand_wave>

[reply]
[d/l]
[select]

Re: Re: need more efficient way to write this query script

by Anonymous Monk on Aug 16, 2002 at 07:09 UTC

A New Hope

grep

are not

actually

aren't

grep

use

[reply]

Re: Re: Re: need more efficient way to write this query script

by perrin (Chancellor) on Aug 16, 2002 at 14:29 UTC

"These aren't the pants you're looking for."

[reply]

Re: need more efficient way to write this query script
by dws (Chancellor) on Aug 16, 2002 at 02:11 UTC

is there a better way to make the searching faster?

Indeed there is. You can do away with the inner foreach loop (and a lot of regex recompiling) by constructing a regex and compiling it once. Change

  while ($data = <DB>) {
    chomp($data);
    foreach $id (@id) {
      if ($data =~/$id/) {
        print OUT "$data\n";
      }
    }
  }
[download]

  my $regex = '(?:' . join('|', @id) . ')';

  while ( $data = <DB> ) {
    print OUT $data if $data =~ /$regex/o;
  }
[download]

(?: ... )

$1

The /o modifier on the regex invocation says to compile the regex once (a big win). Bye bye, inner loop.

Note that you also do away with the chomp(), since it doesn't affect the regex, and you'll need a newline on the end of the string anyway.

[reply]
[d/l]
[select]

Re: Re: need more efficient way to write this query script

by Ovid (Cardinal) on Aug 16, 2002 at 04:52 UTC

Eek! While I suspect that would be faster, it's also is likely to involve a huge amount of backtracking if you have many IDs. Further, if the IDs are similar, this will decrease the performance even more. A straight hash lookup is going to be faster.

Of course, showing people how to dynamically build regexes (even simple ones) is always a nice trick :)

Cheers,
Ovid

Join the Perlmonks Setiathome Group or just click on the the link and check out our stats.

[reply]

Re: Re: Re: need more efficient way to write this query script

by dws (Chancellor) on Aug 16, 2002 at 04:58 UTC

While I suspect that would be faster, it's also is likely to involve a huge amount of backtracking if you have many IDs.

True, but the target strings shown are pretty small. I'd benchmark this one before making assumptions about where the cutoverpoint between using a regex and using a hash is. I suspect that for a small number of keys, the regex wins.

Hm... the original post doesn't give us much guidance about the cardinality of the key file. Given that, I'd probably go with a hash after all.

[reply]