Algorithm to search and replace on data file, based on a table file?

koknat has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Algorithm to search and replace on data file, based on a table file? by MidLifeXis (Monsignor) on Sep 23, 2009 at 17:03 UTC
I cannot limit each line to just one substitution. I need to make multiple substitutions on some lines of my data file. and How can I prevent this from happening? are at odds with each other. Can you define in what cases you want the data to be manipulated more than once, or those cases where it should be left alone? --MidLifeXis Please consider supporting my wife as she walks in the 2009 Alzheimer's Walk.	[reply]
Re^2: Algorithm to search and replace on data file, based on a table file? by koknat (Sexton) on Sep 23, 2009 at 17:35 UTC
I'll clarify: I want to allow multiple substitutions on each line. I don't want the algorithm to make substitutions on data which has already been changed. - Chris Koknat	[reply]
Re^3: Algorithm to search and replace on data file, based on a table file? by MidLifeXis (Monsignor) on Sep 23, 2009 at 17:44 UTC
Now this is an interesting problem. One solution that I could think of right off the bat is to add a marker to the beginning and ending of the replacement string before applying it, and then not allowing the search string to operate in the area bounded by an open/close marker. Remove the markers before emitting the output. If you do follow this approach, watch out for greedy matching. By your definition, you should never have nested "this is changed" markers, so a minimal match on the search string should suffice. --MidLifeXis Please consider supporting my wife as she walks in the 2009 Alzheimer's Walk.	[reply]
Re^3: Algorithm to search and replace on data file, based on a table file? by NiJo (Friar) on Sep 23, 2009 at 18:17 UTC
Your spec still has undefined behaviour regarding line ordering. What should happen with `foo bar foo baz` [download] ?	[reply] [d/l]
Re^4: Algorithm to search and replace on data file, based on a table file? by koknat (Sexton) on Sep 23, 2009 at 18:59 UTC
Re: Algorithm to search and replace on data file, based on a table file? by kennethk (Abbot) on Sep 23, 2009 at 17:11 UTC
Question: do any of your search strings contain regex control characters (`.`, `(`, `[` ...)? If so, you need to modify your regular expression line to `$replacecount += ( $line =~ s/\Q$key\E/$table_ref->{$key}/g );` to avoid unintended consequences. It's probably a good idea any, just in case. See String interpolation to see a bit more about how interpolation works in Perl. Regarding the main part of your question, I read your issue as "I want all substitutions to only occur if they would have occurred on the original file". Might I suggest, then caching the original line and testing if you would have a match before substituting? You could put an error statement in there as well to make sure the number of substitutions matched the number of expected substitutions. Something like: # I have already read the 2-column table file into %table. # I've also already opened INFILE and OUTFILE my @keys_ordered = sort { length $table_ref->{$b} <=> length $table_ref->{$a} } keys %{$table_ref}; # Sorted in reverse order of length, so that longer strings # in the data file are found before shorter strings # Done here to increase speed, vs inside the while loop my $replacecount = 0; while ( my $old_line = <INFILE> ) { my $new_line = $old_line; for my $key ( @keys_ordered ) { my $old_count = my $new_count = 0; if ($old_count = ($old_line =~ /\Q$key\E/g)) { $new_count = ( $new_line =~ s/\Q$key\E/$table_ref->{$key}/ +g ); } if ($new_count != $old_count) { die "Match count failure ($old_count/$new_count) on $line" } $replacecount += $new_count; } print OUTFILE $new_line; } print "Made $replacecount replacements.\n"; [download]	[reply] [d/l] [select]
Re^2: Algorithm to search and replace on data file, based on a table file? by koknat (Sexton) on Sep 23, 2009 at 18:57 UTC
I like your idea of an option for regular expressions. I've used that concept below. Here is the updated code, using your idea of testing for a match before substituting. It seems to do the trick. my $replacecount = 0; my %linenums; my @keys_ordered = reverse sort { length $table_ref->{$a} <=> length $ +table_ref->{$b} } keys %{$table_ref}; LINE: while ( my $old_line = <INFILE> ) { my $new_line = $old_line; my $old_count = 0; my $new_count = 0; KEY: for my $key ( @keys_ordered ) { #if ($debug) { print "DEBUG: \$key = $key\n" } my $backtrack_line = $new_line; my $regex = ( $opt_regex ? $key : quotemeta($key) ); if ( $old_count = ( $old_line =~ /$regex/g) ) { $new_count = ( $new_line =~ s/$regex/$table_ref->{$key}/g +); if ( $new_count != $old_count ) { if ($debug) { print "DEBUG: Match count failure for k +ey $key on ($old_count/$new_count) on line $.: $old_line" } $new_count = $old_count; $new_line = $backtrack_line; next KEY; } if ( $debug ) { print "DEBUG: line $. replacing $key with $table_ref- +>{$key}\n"; $linenums{$.}++; } $replacecount += $new_count; } else { } # do nothing } print OUTFILE $new_line; next LINE; } if ( $debug ) { for my $linenum ( sort keys %linenums ) { if ( $linenums{$linenum} > 1 ) { print "DEBUG: WARNING: line $linenum had $linenums{$line +num} replacements!\n"; } } } print "Made $replacecount replacements in $output_file\n"; print "Done.\n\n"; [download] Thanks! - Chris Koknat	[reply] [d/l]
Re^3: Algorithm to search and replace on data file, based on a table file? by koknat (Sexton) on Oct 02, 2009 at 23:10 UTC
Here is the latest copy of the code, in its entirety. It can handle multiple input files, gives usage information, and has several options. I kept the regex searching simple, so that it would work on any machine without additional libraries. #!/usr/bin/env perl use warnings; use strict; use Getopt::Long; # process command line options use Dumpvalue; # used for debugging hash data structures &usage if ( @ARGV < 3 ); sub usage { print "\n search_replace_table.pl by Chris Koknat special thanks to David Irving and Kenneth K on PerlMonks Purpose: Search and replace on files using a two-column table file Options: -table file Use this file for all search/replace. The format is as follows: search1 replace1 search2 replace2 search3 replace3 Example: foo BA BA BAH -reverse Use column 2 to search, and column 1 to replace Default is to use column 1 to search, and column 2 to replace -suffix The default output filename is the name of the input file, appende +d with _ Use this option to change the suffix -regex Treat the entries in the table file as Perl regular expressions (^ +,\$,[,],etc) Usage examples: search_replace_table.pl -table tablefile file search_replace_table.pl -table tablefile .stil search_replace_table.pl -table tablefile file -suffix _fixed "; exit; } # Process options my $debug = 0; my $opt_reverse = 0; my $opt_tablefile = ""; my $opt_suffix = "_"; my $opt_regex = 0; Getopt::Long::GetOptions( "debug" => \$debug, "reverse" => \$opt_reverse, "table=s" => \$opt_tablefile, "suffix=s" => \$opt_suffix, "regex" => \$opt_regex, ) or exit; # Banner print "\n\n"; print " ###############################\n"; print " ### search_replace_table.pl ###\n"; print " ###############################\n\n"; # Read table file my $table_ref; if ( $opt_tablefile ) { $table_ref = read_table_file($opt_tablefile,$opt_reverse) } else { print "ERROR: You need to specify a table file with the -table op +tion\n"; print "Exiting.\n\n"; exit 1; } if ( $debug ) { print "\nDEBUG: Dumping \$table_ref:\n"; Dumpvalue->new->dumpValue(\$table_ref); } # Sort the keys early to save time my @keys_ordered = reverse sort { length $a <=> length $b } keys %{$ta +ble_ref}; if ( $debug ) { print "\nDEBUG: \@keys_ordered = @keys_ordered\n" } # Loop through input files while ( @ARGV ) { # $input_file and $output_filenames my $input_file = shift @ARGV; my $output_file = newFilename($input_file . $opt_suffix); if ( $debug ) { print "\nDEBUG: \$output_file = $output_file\n" } # Open input and output files open( "INFILE", "< $input_file" ) or die "ERROR: Cannot open input + file: $input_file\n\n"; print "Reading from input file: $input_file\n"; open( "OUTFILE", "> $output_file" ) or die "ERROR: Cannot open out +put file: $output_file\n\n"; print "Writing to output file: $output_file\n"; # Loop through input file my $replacecount = 0; LINE: while ( my $line = <INFILE> ) { if ($debug>=2) { print "DEBUG: $. $line"; } my $old_line = $line; my $new_line = $line; my $old_count = 0; my $new_count = 0; KEY: for my $key ( @keys_ordered ) { if ($debug) { print "DEBUG: \$key = $key\n" } my $regex = ( $opt_regex ? $key : quotemeta($key) ); if ( $debug ) { print "DEBUG: \$regex = $regex\n" } if ( $debug ) { print "DEBUG: \$old_line = $old_line" } my $backtrack_line = $new_line; my $tmp_line = $old_line; if ( $old_count = ( $tmp_line =~ s/$regex/$table_ref->{$ke +y}/g ) ) { if ( $debug ) { print "DEBUG: \$old_count = $old_coun +t\n" } $new_count = ( $new_line =~ s/$regex/$table_ref->{$key +}/g ); if ( $debug ) { print "DEBUG: \$new_count = $new_coun +t\n" } if ( $debug ) { print "DEBUG: \$new_line = $new_line" + } if ( $new_count != $old_count ) { if ($debug) { print "DEBUG: Match count failure f +or key $key on ($old_count/$new_count) on line $.: $old_line" } $new_count = $old_count; $new_line = $backtrack_line; next KEY; } if ( $debug ) { print "DEBUG: line $. replacing $key +with $table_ref->{$key}\n" } $replacecount += $new_count; } else { } # do nothing } print OUTFILE $new_line; next LINE; } print "Made $replacecount replacements in $output_file\n"; } print "\nDone.\n\n"; ########################### ### END OF MAIN PROGRAM ### ########################### # $ = read_table_file($filename,$opt_reverse) # Reads 2-column table mapping file, returns hash reference sub read_table_file { my $reverse = 0; my $table_file = $_[0]; if ( defined $_[1] and $_[1] == 1 ) { $reverse = 1; } my %map; open( "MAPFILE", "< $table_file" ) or die "ERROR: Cannot open fil +e for reading: $table_file\n\n"; print "Reading from table file $table_file\n"; while ( my $line = <MAPFILE> ) { # blank line or # comment if ( $line =~ /^\s$/ or $line =~ /^\s#/ ) { # do nothing } # search_term replace_term # xtals_in XTAL_SSIN elsif ( $line =~ /^\s(\S+)\s+(\S+)/ ) { my ($key,$value) = ($1,$2); if ( $reverse ) { if ( defined $map{$value} ) { print "ERROR: Table file $table_file contains 2 e +ntries for $value in the right column!\n"; print " 1st entry was: $map{$value} $valu +e\n"; print " 2nd entry was: $key $value on li +ne $.\n"; print "Exiting.\n\n"; exit 1; } else { $map{$value} = $key; } } else { if ( defined $map{$key} ) { print "ERROR: Table file $table_file contains 2 e +ntries for $key in the left column!\n"; print " 1st entry was: $key $map{$key}\n" +; print " 2nd entry was: $key $value on li +ne $.\n"; print "Exiting.\n\n"; exit 1; } else { $map{$key} = $value; } } } # Unknown else { print "ERROR: Did not understand line in config file: $l +ine"; exit 1; } } return \%map; } # Return a valid filename that doesn't already exist # $ = newFilename($filename) sub newFilename { my $filename = $_[0]; while ( -e $filename ) { $filename .= "_"; } return $filename; } [download] - Chris Koknat	[reply] [d/l]
Re^4: Algorithm to search and replace on data file, based on a table file? by koknat (Sexton) on Oct 06, 2009 at 18:21 UTC
Re: Algorithm to search and replace on data file, based on a table file? by dirving (Friar) on Sep 24, 2009 at 05:26 UTC
Another way to solve your problem would be to search and replace all the patterns simultaneously in one regular expression instead of doing them sequentially. This will also give you a fairly substantial performance boost, since instead of invoking the regular expression engine 600 times for every line you just do it once per line. Since it sounds like your patterns are just strings to match against, you can probably get away with joining them all together separated by `\|`. Just note that this is pretty inefficient pre-Perl 5.10, although it is probably faster than what you have now since it moves the loop into the regular expression engine's C code. If you haven't moved to 5.10 yet, look into Regexp::Assemble to create the regular expression. Your code would then look something like this: # In your code you are sorting on the length of the replacement string +s # instead of the search strings. I'm guessing that's not what you wan +t, # so I switched it to sort on the search patterns. my @keys_ordered = sort { length $b <=> length $a } keys %$table_ref; # Join your strings into one big long RE my $re_string = join '\|', map( qr/\Q$_\E/, @keys_ordered ); my $re = qr/($re_string)/; my $replacecount = 0; while ( my $line = <INFILE> ) { # The inner loop is gone $replacecount += ( $line =~ s/$re/$table_ref->{$1}/g ); print OUTFILE $line; } print "Made $replacecount replacements.\n"; [download] -- David Irving	[reply] [d/l] [select]
Re^2: Algorithm to search and replace on data file, based on a table file? by dolmen (Beadle) on Sep 24, 2009 at 08:46 UTC
I had the same idea: Grinder's RegExp::Assemble. I can only approve David's suggestions.	[reply]
Re^2: Algorithm to search and replace on data file, based on a table file? by koknat (Sexton) on Sep 24, 2009 at 21:31 UTC
Our workplace is standardized on Perl 5.8.8 What is the new feature of 5.10 that enables your technique? Which line of your code? - Chris	[reply]
Re^3: Algorithm to search and replace on data file, based on a table file? by dirving (Friar) on Sep 25, 2009 at 02:19 UTC
The technique will still work on Perl 5.8, the only difference is the performance. In Perl 5.8 if you have a regular expression like `foo\|bar\|baz` it tests for each string one a time. In Perl 5.10 there is an optimization that builds a data structure called a trie, which lets it match against all the strings at the same time. So in Perl 5.8 the time taken is proportional to the number of search strings in your table -- in Perl 5.10 the time taken doesn't depend on the size of the table. As I mentioned in my post, you can use Regexp::Assemble to get performance comparable to Perl 5.10's optimization in Perl 5.8 -- it takes care of re-writing the regexp in a more efficient way. Even if you don't use Regexp::Assemble, even on Perl 5.8 this technique will probably be faster than what you have (and almost definitely no slower). You just won't see the order-of-magnitude speedups that you could probably get by using either 5.10 or Regexp::Assemble. -- David Irving	[reply] [d/l]