Re: Algorithm to search and replace on data file, based on a table file?

Another way to solve your problem would be to search and replace all the patterns simultaneously in one regular expression instead of doing them sequentially. This will also give you a fairly substantial performance boost, since instead of invoking the regular expression engine 600 times for every line you just do it once per line.

Since it sounds like your patterns are just strings to match against, you can probably get away with joining them all together separated by |. Just note that this is pretty inefficient pre-Perl 5.10, although it is probably faster than what you have now since it moves the loop into the regular expression engine's C code. If you haven't moved to 5.10 yet, look into Regexp::Assemble to create the regular expression.

Your code would then look something like this:

# In your code you are sorting on the length of the replacement string
+s
# instead of the search strings.  I'm guessing that's not what you wan
+t,
# so I switched it to sort on the search patterns.
my @keys_ordered = sort { length $b <=> length $a } keys %$table_ref;

# Join your strings into one big long RE
my $re_string = join '|', map( qr/\Q$_\E/, @keys_ordered );
my $re = qr/($re_string)/;

my $replacecount = 0;
while ( my $line = <INFILE> ) {
    # The inner loop is gone
    $replacecount += ( $line =~ s/$re/$table_ref->{$1}/g );
    print OUTFILE $line;
}    
print "Made $replacecount replacements.\n";
[download]

-- David Irving

Comment on Re: Algorithm to search and replace on data file, based on a table file? Select or Download Code

Replies are listed 'Best First'.
Re^2: Algorithm to search and replace on data file, based on a table file? by dolmen (Beadle) on Sep 24, 2009 at 08:46 UTC
I had the same idea: Grinder's RegExp::Assemble. I can only approve David's suggestions.	[reply]
Re^2: Algorithm to search and replace on data file, based on a table file? by koknat (Sexton) on Sep 24, 2009 at 21:31 UTC
Our workplace is standardized on Perl 5.8.8 What is the new feature of 5.10 that enables your technique? Which line of your code? - Chris	[reply]
Re^3: Algorithm to search and replace on data file, based on a table file? by dirving (Friar) on Sep 25, 2009 at 02:19 UTC
The technique will still work on Perl 5.8, the only difference is the performance. In Perl 5.8 if you have a regular expression like `foo\|bar\|baz` it tests for each string one a time. In Perl 5.10 there is an optimization that builds a data structure called a trie, which lets it match against all the strings at the same time. So in Perl 5.8 the time taken is proportional to the number of search strings in your table -- in Perl 5.10 the time taken doesn't depend on the size of the table. As I mentioned in my post, you can use Regexp::Assemble to get performance comparable to Perl 5.10's optimization in Perl 5.8 -- it takes care of re-writing the regexp in a more efficient way. Even if you don't use Regexp::Assemble, even on Perl 5.8 this technique will probably be faster than what you have (and almost definitely no slower). You just won't see the order-of-magnitude speedups that you could probably get by using either 5.10 or Regexp::Assemble. -- David Irving	[reply] [d/l]