koknat has asked for the wisdom of the Perl Monks concerning the following question:

Hi Experts,

I have a set of ~1GB ASCII files.
In each file, I'd like to search and replace, based on an input table file.
The table file has 2 columns, one for search_string, the other for replace_string
Example of table file entry:
foo bar

I already coded it, but would like some advice.
This code takes 70 minutes on a 1GB ASCII file, with a table file of 600 key/value pairs, running on a HP XW4100 workstation.
Here is the code:
# I have already read the 2-column table file into %table. # I've also already opened INFILE and OUTFILE my @keys_ordered = sort { length $table_ref->{$b} <=> length $table_ref->{$a} } keys %{$table_ref}; # Sorted in reverse order of length, so that longer strings # in the data file are found before shorter strings # Done here to increase speed, vs inside the while loop my $replacecount = 0; while ( my $line = <INFILE> ) { for my $key ( @keys_ordered ) { $replacecount += ( $line =~ s/$key/$table_ref->{$key}/g ); } print OUTFILE $line; } print "Made $replacecount replacements.\n";

This works just fine on a simple set of input data, but I'm worried that this algorithm will have unintended consequences, since it is iterating through every key, even if a replace has already been done on that part of the line.

Suppose that this table file was used:
foo bar
ba BAH

Suppose my input data file contains the line:
"This algorithm is foo"

I want the output data file to become:
"This algorithm is bar"

Instead, the output data file will look like this:
"This algorithm is BAHr"
Since "foo" became "bar", and the next time through the loop "ba" became "BAH"

How can I prevent this from happening?
I cannot limit each line to just one substitution. I need to make multiple substitutions on some lines of my data file.

Thanks,

- Chris Koknat
  • Comment on Algorithm to search and replace on data file, based on a table file?
  • Download Code

Replies are listed 'Best First'.
Re: Algorithm to search and replace on data file, based on a table file?
by MidLifeXis (Monsignor) on Sep 23, 2009 at 17:03 UTC

    I cannot limit each line to just one substitution. I need to make multiple substitutions on some lines of my data file.
    and
    How can I prevent this from happening?
    are at odds with each other. Can you define in what cases you want the data to be manipulated more than once, or those cases where it should be left alone?

    --MidLifeXis

    Please consider supporting my wife as she walks in the 2009 Alzheimer's Walk.

      I'll clarify:
      I want to allow multiple substitutions on each line.
      I don't want the algorithm to make substitutions on data which has already been changed.

      - Chris Koknat

        Now this is an interesting problem.

        One solution that I could think of right off the bat is to add a marker to the beginning and ending of the replacement string before applying it, and then not allowing the search string to operate in the area bounded by an open/close marker. Remove the markers before emitting the output.

        If you do follow this approach, watch out for greedy matching. By your definition, you should never have nested "this is changed" markers, so a minimal match on the search string should suffice.

        --MidLifeXis

        Please consider supporting my wife as she walks in the 2009 Alzheimer's Walk.

        Your spec still has undefined behaviour regarding line ordering. What should happen with
        foo bar foo baz
        ?
Re: Algorithm to search and replace on data file, based on a table file?
by kennethk (Abbot) on Sep 23, 2009 at 17:11 UTC
    Question: do any of your search strings contain regex control characters (., (, [ ...)? If so, you need to modify your regular expression line to

    $replacecount += ( $line =~ s/\Q$key\E/$table_ref->{$key}/g );

    to avoid unintended consequences. It's probably a good idea any, just in case. See String interpolation to see a bit more about how interpolation works in Perl.

    Regarding the main part of your question, I read your issue as "I want all substitutions to only occur if they would have occurred on the original file". Might I suggest, then caching the original line and testing if you would have a match before substituting? You could put an error statement in there as well to make sure the number of substitutions matched the number of expected substitutions. Something like:

    # I have already read the 2-column table file into %table. # I've also already opened INFILE and OUTFILE my @keys_ordered = sort { length $table_ref->{$b} <=> length $table_ref->{$a} } keys %{$table_ref}; # Sorted in reverse order of length, so that longer strings # in the data file are found before shorter strings # Done here to increase speed, vs inside the while loop my $replacecount = 0; while ( my $old_line = <INFILE> ) { my $new_line = $old_line; for my $key ( @keys_ordered ) { my $old_count = my $new_count = 0; if ($old_count = ($old_line =~ /\Q$key\E/g)) { $new_count = ( $new_line =~ s/\Q$key\E/$table_ref->{$key}/ +g ); } if ($new_count != $old_count) { die "Match count failure ($old_count/$new_count) on $line" } $replacecount += $new_count; } print OUTFILE $new_line; } print "Made $replacecount replacements.\n";

      I like your idea of an option for regular expressions. I've used that concept below.

      Here is the updated code, using your idea of testing for a match before substituting. It seems to do the trick.
      my $replacecount = 0; my %linenums; my @keys_ordered = reverse sort { length $table_ref->{$a} <=> length $ +table_ref->{$b} } keys %{$table_ref}; LINE: while ( my $old_line = <INFILE> ) { my $new_line = $old_line; my $old_count = 0; my $new_count = 0; KEY: for my $key ( @keys_ordered ) { #if ($debug) { print "DEBUG: \$key = $key\n" } my $backtrack_line = $new_line; my $regex = ( $opt_regex ? $key : quotemeta($key) ); if ( $old_count = ( $old_line =~ /$regex/g) ) { $new_count = ( $new_line =~ s/$regex/$table_ref->{$key}/g +); if ( $new_count != $old_count ) { if ($debug) { print "DEBUG: Match count failure for k +ey $key on ($old_count/$new_count) on line $.: $old_line" } $new_count = $old_count; $new_line = $backtrack_line; next KEY; } if ( $debug ) { print "DEBUG: line $. replacing $key with $table_ref- +>{$key}\n"; $linenums{$.}++; } $replacecount += $new_count; } else { } # do nothing } print OUTFILE $new_line; next LINE; } if ( $debug ) { for my $linenum ( sort keys %linenums ) { if ( $linenums{$linenum} > 1 ) { print "DEBUG: WARNING: line $linenum had $linenums{$line +num} replacements!\n"; } } } print "Made $replacecount replacements in $output_file\n"; print "Done.\n\n";

      Thanks!

      - Chris Koknat
        Here is the latest copy of the code, in its entirety.
        It can handle multiple input files, gives usage information, and has several options.

        I kept the regex searching simple, so that it would work on any machine without additional libraries.
        #!/usr/bin/env perl use warnings; use strict; use Getopt::Long; # process command line options use Dumpvalue; # used for debugging hash data structures &usage if ( @ARGV < 3 ); sub usage { print "\n search_replace_table.pl by Chris Koknat special thanks to David Irving and Kenneth K on PerlMonks Purpose: Search and replace on files using a two-column table file Options: -table file Use this file for all search/replace. The format is as follows: search1 replace1 search2 replace2 search3 replace3 Example: foo BA BA BAH -reverse Use column 2 to search, and column 1 to replace Default is to use column 1 to search, and column 2 to replace -suffix The default output filename is the name of the input file, appende +d with _ Use this option to change the suffix -regex Treat the entries in the table file as Perl regular expressions (^ +,\$,[,],etc) Usage examples: search_replace_table.pl -table tablefile file search_replace_table.pl -table tablefile *.stil search_replace_table.pl -table tablefile file -suffix _fixed "; exit; } # Process options my $debug = 0; my $opt_reverse = 0; my $opt_tablefile = ""; my $opt_suffix = "_"; my $opt_regex = 0; Getopt::Long::GetOptions( "debug" => \$debug, "reverse" => \$opt_reverse, "table=s" => \$opt_tablefile, "suffix=s" => \$opt_suffix, "regex" => \$opt_regex, ) or exit; # Banner print "\n\n"; print " ###############################\n"; print " ### search_replace_table.pl ###\n"; print " ###############################\n\n"; # Read table file my $table_ref; if ( $opt_tablefile ) { $table_ref = read_table_file($opt_tablefile,$opt_reverse) } else { print "ERROR: You need to specify a table file with the -table op +tion\n"; print "Exiting.\n\n"; exit 1; } if ( $debug ) { print "\nDEBUG: Dumping \$table_ref:\n"; Dumpvalue->new->dumpValue(\$table_ref); } # Sort the keys early to save time my @keys_ordered = reverse sort { length $a <=> length $b } keys %{$ta +ble_ref}; if ( $debug ) { print "\nDEBUG: \@keys_ordered = @keys_ordered\n" } # Loop through input files while ( @ARGV ) { # $input_file and $output_filenames my $input_file = shift @ARGV; my $output_file = newFilename($input_file . $opt_suffix); if ( $debug ) { print "\nDEBUG: \$output_file = $output_file\n" } # Open input and output files open( "INFILE", "< $input_file" ) or die "ERROR: Cannot open input + file: $input_file\n\n"; print "Reading from input file: $input_file\n"; open( "OUTFILE", "> $output_file" ) or die "ERROR: Cannot open out +put file: $output_file\n\n"; print "Writing to output file: $output_file\n"; # Loop through input file my $replacecount = 0; LINE: while ( my $line = <INFILE> ) { if ($debug>=2) { print "DEBUG: $. $line"; } my $old_line = $line; my $new_line = $line; my $old_count = 0; my $new_count = 0; KEY: for my $key ( @keys_ordered ) { if ($debug) { print "DEBUG: \$key = $key\n" } my $regex = ( $opt_regex ? $key : quotemeta($key) ); if ( $debug ) { print "DEBUG: \$regex = $regex\n" } if ( $debug ) { print "DEBUG: \$old_line = $old_line" } my $backtrack_line = $new_line; my $tmp_line = $old_line; if ( $old_count = ( $tmp_line =~ s/$regex/$table_ref->{$ke +y}/g ) ) { if ( $debug ) { print "DEBUG: \$old_count = $old_coun +t\n" } $new_count = ( $new_line =~ s/$regex/$table_ref->{$key +}/g ); if ( $debug ) { print "DEBUG: \$new_count = $new_coun +t\n" } if ( $debug ) { print "DEBUG: \$new_line = $new_line" + } if ( $new_count != $old_count ) { if ($debug) { print "DEBUG: Match count failure f +or key $key on ($old_count/$new_count) on line $.: $old_line" } $new_count = $old_count; $new_line = $backtrack_line; next KEY; } if ( $debug ) { print "DEBUG: line $. replacing $key +with $table_ref->{$key}\n" } $replacecount += $new_count; } else { } # do nothing } print OUTFILE $new_line; next LINE; } print "Made $replacecount replacements in $output_file\n"; } print "\nDone.\n\n"; ########################### ### END OF MAIN PROGRAM ### ########################### # $ = read_table_file($filename,$opt_reverse) # Reads 2-column table mapping file, returns hash reference sub read_table_file { my $reverse = 0; my $table_file = $_[0]; if ( defined $_[1] and $_[1] == 1 ) { $reverse = 1; } my %map; open( "MAPFILE", "< $table_file" ) or die "ERROR: Cannot open fil +e for reading: $table_file\n\n"; print "Reading from table file $table_file\n"; while ( my $line = <MAPFILE> ) { # blank line or # comment if ( $line =~ /^\s*$/ or $line =~ /^\s*#/ ) { # do nothing } # search_term replace_term # xtals_in XTAL_SSIN elsif ( $line =~ /^\s*(\S+)\s+(\S+)/ ) { my ($key,$value) = ($1,$2); if ( $reverse ) { if ( defined $map{$value} ) { print "ERROR: Table file $table_file contains 2 e +ntries for $value in the right column!\n"; print " 1st entry was: $map{$value} $valu +e\n"; print " 2nd entry was: $key $value on li +ne $.\n"; print "Exiting.\n\n"; exit 1; } else { $map{$value} = $key; } } else { if ( defined $map{$key} ) { print "ERROR: Table file $table_file contains 2 e +ntries for $key in the left column!\n"; print " 1st entry was: $key $map{$key}\n" +; print " 2nd entry was: $key $value on li +ne $.\n"; print "Exiting.\n\n"; exit 1; } else { $map{$key} = $value; } } } # Unknown else { print "ERROR: Did not understand line in config file: $l +ine"; exit 1; } } return \%map; } # Return a valid filename that doesn't already exist # $ = newFilename($filename) sub newFilename { my $filename = $_[0]; while ( -e $filename ) { $filename .= "_"; } return $filename; }

        - Chris Koknat
Re: Algorithm to search and replace on data file, based on a table file?
by dirving (Friar) on Sep 24, 2009 at 05:26 UTC

    Another way to solve your problem would be to search and replace all the patterns simultaneously in one regular expression instead of doing them sequentially. This will also give you a fairly substantial performance boost, since instead of invoking the regular expression engine 600 times for every line you just do it once per line.

    Since it sounds like your patterns are just strings to match against, you can probably get away with joining them all together separated by |. Just note that this is pretty inefficient pre-Perl 5.10, although it is probably faster than what you have now since it moves the loop into the regular expression engine's C code. If you haven't moved to 5.10 yet, look into Regexp::Assemble to create the regular expression.

    Your code would then look something like this:

    # In your code you are sorting on the length of the replacement string +s # instead of the search strings. I'm guessing that's not what you wan +t, # so I switched it to sort on the search patterns. my @keys_ordered = sort { length $b <=> length $a } keys %$table_ref; # Join your strings into one big long RE my $re_string = join '|', map( qr/\Q$_\E/, @keys_ordered ); my $re = qr/($re_string)/; my $replacecount = 0; while ( my $line = <INFILE> ) { # The inner loop is gone $replacecount += ( $line =~ s/$re/$table_ref->{$1}/g ); print OUTFILE $line; } print "Made $replacecount replacements.\n";
    -- David Irving

      I had the same idea: Grinder's RegExp::Assemble. I can only approve David's suggestions.

      Our workplace is standardized on Perl 5.8.8
      What is the new feature of 5.10 that enables your technique?
      Which line of your code?

      - Chris

        The technique will still work on Perl 5.8, the only difference is the performance. In Perl 5.8 if you have a regular expression like foo|bar|baz it tests for each string one a time. In Perl 5.10 there is an optimization that builds a data structure called a trie, which lets it match against all the strings at the same time. So in Perl 5.8 the time taken is proportional to the number of search strings in your table -- in Perl 5.10 the time taken doesn't depend on the size of the table.

        As I mentioned in my post, you can use Regexp::Assemble to get performance comparable to Perl 5.10's optimization in Perl 5.8 -- it takes care of re-writing the regexp in a more efficient way. Even if you don't use Regexp::Assemble, even on Perl 5.8 this technique will probably be faster than what you have (and almost definitely no slower). You just won't see the order-of-magnitude speedups that you could probably get by using either 5.10 or Regexp::Assemble.

        -- David Irving