Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I have this little query script and it's very very slow!
Is there a mistake or is there a better way to make the searching faster?

@id =<ID_LIST>; chomp(@id); while($data = <DB>) { chomp($data); foreach $id (@id) { if ($data =~ /$id/) { print OUT "$data\n"; } } } __comment__ sample contents of the ID_LIST: user45-16.dept12 user35-18.dept19 user05-14.dept10 sample contents of the DB: (3 MB of text) home--user45-16.dept12-ph#435 home66--user05-17.dept19-ddfg#5;v6.0234 home4--user35-18.dept19-hj#5 home87--user05-14.dept10-dfg#7;v2.0

Replies are listed 'Best First'.
Re: need more efficient way to write this query script
by Ovid (Cardinal) on Aug 15, 2002 at 23:42 UTC

    Just off the top of my head (read: untested), I'd use a hash. If your IDs are truly that simple to parse, try this:

    chomp ( my @id = <ID_LIST> ); my %ids; @id{@id} = (1) x @id; # dropped the chomp because you added \n back in while ( my $data = <DB> ) { my ($curr_id) = $data =~ /(user\d\d-\d\d\.dept\d\d)/; if ( $curr_id and exists $id{ $curr_id } ) { print OUT $data; } }

    The problem you were having is that nested loops tend to not scale well. If your while loop iterates over 10 elements and you have 10 elements in @id, then you have a total of 100 (10 x 10) iterations. Bump that up to 100 items in the while loop and 100 elements in @id and you have 100 x 100 iterations: 10,000. Whenever you see something like that, the trick is to figure out how to remove one of the loops. Even if my single lookup on the hash key were horribly inefficient (for example, if my regex to extract the id had much backtracking), that would still probably scale better than a nested loop.

    Cheers,
    Ovid

    Join the Perlmonks Setiathome Group or just click on the the link and check out our stats.

Re: need more efficient way to write this query script
by RMGir (Prior) on Aug 15, 2002 at 23:38 UTC
    I'd suggest using a *DBM of some kind as a tied hash for your DB; once you've built it, you get almost free lookups for your values, which would save you your inner loop.

    You'd have to encode multiple values for each key, but you could do that by comma separating them and then doing a split on the value the DB returns. It'd still be MUCH faster than looping over a flat file.

    Check out the documentation for AnyDBM_File; it's a bit odd to use tied hashes at first, but you get used to it.
    --
    Mike

Re: need more efficient way to write this query script
by grep (Monsignor) on Aug 15, 2002 at 23:26 UTC
    My quick suggestion would be to use index instead of this regex $data =~ /$id/

    if (index($data,$id)+1){

    UPDATE:Ovid's and RMGir's soutions are much more scalable. My suggestion is more of use index instead of regexes.

    Benchmark

    Benchmark: timing 10000 iterations of index, regex... index: 3 wallclock secs ( 3.52 usr + 0.00 sys = 3.52 CPU) @ 28 +40.91/s (n=10000) regex: 92 wallclock secs (91.44 usr + 0.00 sys = 91.44 CPU) @ 10 +9.36/s (n=10000)

    Code for benchmark

    sub regex { foreach my $data (@db) { foreach my $id (@id) { if ($data =~ /$id/) {1} } } } sub index { foreach my $data (@db) { foreach my $id (@id) { if (index($data,$id)+1) {1} } } }

    UPDATE 2: to reiterate Ovid's point - The benchmark of my nested loop solution converges with Ovid's solution with just a hundred entries in each of your data sets



    grep
    <hand_wave> These are not the monks you're looking for </hand_wave>
      Geek .sig critique...

      Confucius say: have nerve to create fancy-dan "funny" .sig regarding sacred A New Hope - better get quote right.

      grep say "These are not the monks you're looking for"

      Obi-Wan-Kenobi actually say "These aren't the the monks you're looking for"

      Moral of story: grep must actually use actual Obi-Wan quote before constructing fancy-dan HTML .sig.
        Gosh, I remember when we used to call that "Star Wars".

        "These aren't the pants you're looking for."

Re: need more efficient way to write this query script
by dws (Chancellor) on Aug 16, 2002 at 02:11 UTC
    is there a better way to make the searching faster?

    Indeed there is. You can do away with the inner foreach loop (and a lot of regex recompiling) by constructing a regex and compiling it once. Change

    while ($data = <DB>) { chomp($data); foreach $id (@id) { if ($data =~/$id/) { print OUT "$data\n"; } } }
    to
    my $regex = '(?:' . join('|', @id) . ')'; while ( $data = <DB> ) { print OUT $data if $data =~ /$regex/o; }
    The (?: ... ) surrounding the regex is to avoid setting $1 as a side effect. This saves a tiny bit of work.

    The /o modifier on the regex invocation says to compile the regex once (a big win). Bye bye, inner loop.

    Note that you also do away with the chomp(), since it doesn't affect the regex, and you'll need a newline on the end of the string anyway.

      Eek! While I suspect that would be faster, it's also is likely to involve a huge amount of backtracking if you have many IDs. Further, if the IDs are similar, this will decrease the performance even more. A straight hash lookup is going to be faster.

      Of course, showing people how to dynamically build regexes (even simple ones) is always a nice trick :)

      Cheers,
      Ovid

      Join the Perlmonks Setiathome Group or just click on the the link and check out our stats.

        While I suspect that would be faster, it's also is likely to involve a huge amount of backtracking if you have many IDs.

        True, but the target strings shown are pretty small. I'd benchmark this one before making assumptions about where the cutoverpoint between using a regex and using a hash is. I suspect that for a small number of keys, the regex wins.


        Hm... the original post doesn't give us much guidance about the cardinality of the key file. Given that, I'd probably go with a hash after all.