in reply to Re^2: Efficient regex matching with qr//; Can I do better?
in thread Efficient regex matching with qr//; Can I do better?

Thanks a lot all of you. I'm sorry I didn't specify the types of pattern clearly, but both the patterns and the texts consist of literals (a-z,0-9) and single spaces only. I have only had time to briefly look at your answers. The idea of doing as much as possible outside the loop is probably a good one which I'll try. The idea of using an array and specify indexes I didn't really understand the benefit of at first glance. Thirdly, the idea to just do one (or a few) large regex matches sounds very interesting. I use ActiveState Perl v. 5.8.8 for windows, so I realize it might be worth upgrading. I don't quite know what their latest version available is. Again, thanks a lot for your efforts!
  • Comment on Re^3: Efficient regex matching with qr//; Can I do better?

Replies are listed 'Best First'.
Re^4: Efficient regex matching with qr//; Can I do better? (Benchmark)
by moritz (Cardinal) on Jul 11, 2008 at 10:42 UTC
    To illustrate the speed difference I wrote a small benchmark that assembles 500 random strings of length 5 to 15 into a regex, and matches that against a random string with 1 million characters. Here's the result:
    # perl 5.10.0: timethis 10: 0 wallclock secs ( 0.33 usr + 0.00 sys = 0.33 CPU) @ 3 +0.30/s (n=10) (warning: too few iterations for a reliable count) # perl 5.8.8: timethis 10: 79 wallclock secs (78.92 usr + 0.04 sys = 78.96 CPU) @ +0.13/s (n=10)

    This is on Linux, but I guess the results on Windows are similar. You see that in this case perl5.10.0 takes less than a second, while perl5.8.8 takes over a minute.

    I also tried it with a shorter target (100 chars) and more iterations, and the speed differences are similar.

      Unfortunately, there's a limit to how much data can appear in alterations and still be optimized into a trie in 5.10.0. I think you are within that limit, but kruppy will not be. Try your benchmark again with 10000 strings.
        Well if it doesn't do it automatically, you can ask it to do it explicitly. The code I wrote for RE (tilly) 4: SAS log scanner should still work in Perl 5.10.
Re^4: Efficient regex matching with qr//; Can I do better?
by ysth (Canon) on Jul 11, 2008 at 20:35 UTC
    The idea of using an array and specify indexes I didn't really understand the benefit of at first glance.
    The idea is to eliminate as much code as possible specifically from the parts that run for each pattern and each text string. This means this bit:
    while (my($text_id,$text) = each(%hash2) ) { if ($text =~ $match)
    Which, for each iteration, does this:
    • each returns the key and value
    • those are copied in a list assigment
    • the match is attempted
    Both the copy and the overhead of the list assignment (which is not insignificant, when repeated a billion times) aren't necessary.

    If the text strings are unique, such that you can determine the $text_id given $text, you can leave out the array stuff and just do this:

    my @text = values %hash2; my %reverse_hash2 = reverse %hash2; while ( my($pattern,$high_lvl_id) = each(%hash1) ) { my $match = qr/\b$pattern\b/; for my $text (@text) { if ( $text =~ $match ) { my $text_id = $reverese_hash2{$text}; ... } } }
    Because for (@array) doesn't copy array but just iterates over it, and $text is aliased to each element in turn, not copied over it, this should be faster.
      That is true. Does that also mean that it would be more efficient to write
      foreach my $text_id (keys %hash2) { if ($hash2{$text_id} =~ $match) { ... } }
      because then you copy nothing? If this is the case I was fooled by someone else (not in this thread) who said that using each was more efficient than using for/foreach...
      Your second suggestion would, however, not work because there is no guarantee the strings are unique (in fact, I am certain they are not).
      Thanks.
        Try it and see?

        each() is more memory-efficient, but you aren't talking about millions of hash entries, so that's not likely to be a concern.