Re^3: Efficient regex matching with qr//; Can I do better?

Replies are listed 'Best First'.
Re^4: Efficient regex matching with qr//; Can I do better? (Benchmark) by moritz (Cardinal) on Jul 11, 2008 at 10:42 UTC
To illustrate the speed difference I wrote a small benchmark that assembles 500 random strings of length 5 to 15 into a regex, and matches that against a random string with 1 million characters. Here's the result: `# perl 5.10.0: timethis 10: 0 wallclock secs ( 0.33 usr + 0.00 sys = 0.33 CPU) @ 3 +0.30/s (n=10) (warning: too few iterations for a reliable count) # perl 5.8.8: timethis 10: 79 wallclock secs (78.92 usr + 0.04 sys = 78.96 CPU) @ +0.13/s (n=10)` [download] This is on Linux, but I guess the results on Windows are similar. You see that in this case perl5.10.0 takes less than a second, while perl5.8.8 takes over a minute. I also tried it with a shorter target (100 chars) and more iterations, and the speed differences are similar. Read more... (664 Bytes)	[reply] [d/l] [select]
Re^5: Efficient regex matching with qr//; Can I do better? (Benchmark) by ysth (Canon) on Jul 11, 2008 at 20:22 UTC
Unfortunately, there's a limit to how much data can appear in alterations and still be optimized into a trie in 5.10.0. I think you are within that limit, but kruppy will not be. Try your benchmark again with 10000 strings. -- Online Fortune Cookie Search	[reply]
Re^6: Efficient regex matching with qr//; Can I do better? (Benchmark) by tilly (Archbishop) on Jul 12, 2008 at 00:40 UTC
Well if it doesn't do it automatically, you can ask it to do it explicitly. The code I wrote for RE (tilly) 4: SAS log scanner should still work in Perl 5.10.	[reply]
Re^4: Efficient regex matching with qr//; Can I do better? by ysth (Canon) on Jul 11, 2008 at 20:35 UTC
The idea of using an array and specify indexes I didn't really understand the benefit of at first glance. The idea is to eliminate as much code as possible specifically from the parts that run for each pattern and each text string. This means this bit: `while (my($text_id,$text) = each(%hash2) ) { if ($text =~ $match)` [download] Which, for each iteration, does this: each returns the key and value those are copied in a list assigment the match is attempted Both the copy and the overhead of the list assignment (which is not insignificant, when repeated a billion times) aren't necessary. If the text strings are unique, such that you can determine the $text_id given $text, you can leave out the array stuff and just do this: `my @text = values %hash2; my %reverse_hash2 = reverse %hash2; while ( my($pattern,$high_lvl_id) = each(%hash1) ) { my $match = qr/\b$pattern\b/; for my $text (@text) { if ( $text =~ $match ) { my $text_id = $reverese_hash2{$text}; ... } } }` [download] Because for (@array) doesn't copy array but just iterates over it, and $text is aliased to each element in turn, not copied over it, this should be faster. -- Online Fortune Cookie Search	[reply] [d/l] [select]
Re^5: Efficient regex matching with qr//; Can I do better? by kruppy (Initiate) on Jul 14, 2008 at 05:46 UTC
That is true. Does that also mean that it would be more efficient to write `foreach my $text_id (keys %hash2) { if ($hash2{$text_id} =~ $match) { ... } }` [download] because then you copy nothing? If this is the case I was fooled by someone else (not in this thread) who said that using each was more efficient than using for/foreach... Your second suggestion would, however, not work because there is no guarantee the strings are unique (in fact, I am certain they are not). Thanks.	[reply] [d/l]
Re^6: Efficient regex matching with qr//; Can I do better? by ysth (Canon) on Jul 14, 2008 at 05:49 UTC
Try it and see? each() is more memory-efficient, but you aren't talking about millions of hash entries, so that's not likely to be a concern. -- Online Fortune Cookie Search	[reply]