Hi Experts,
I have a set of ~1GB ASCII files.
In each file, I'd like to search and replace, based on an input table file.
The table file has 2 columns, one for search_string, the other for replace_string
Example of table file entry:
foo bar
I already coded it, but would like some advice.
This code takes 70 minutes on a 1GB ASCII file, with a table file of 600 key/value pairs, running on a HP XW4100 workstation.
Here is the code:
# I have already read the 2-column table file into %table.
# I've also already opened INFILE and OUTFILE
my @keys_ordered = sort { length $table_ref->{$b} <=>
length $table_ref->{$a} } keys %{$table_ref};
# Sorted in reverse order of length, so that longer strings
# in the data file are found before shorter strings
# Done here to increase speed, vs inside the while loop
my $replacecount = 0;
while ( my $line = <INFILE> ) {
for my $key ( @keys_ordered ) {
$replacecount += ( $line =~ s/$key/$table_ref->{$key}/g );
}
print OUTFILE $line;
}
print "Made $replacecount replacements.\n";
This works just fine on a simple set of input data, but I'm worried that this algorithm will have unintended consequences, since it is iterating through every key, even if a replace has already been done on that part of the line.
Suppose that this table file was used:
foo bar
ba BAH
Suppose my input data file contains the line:
"This algorithm is foo"
I want the output data file to become:
"This algorithm is bar"
Instead, the output data file will look like this:
"This algorithm is BAHr"
Since "foo" became "bar", and the next time through the loop "ba" became "BAH"
How can I prevent this from happening?
I cannot limit each line to just one substitution. I need to make multiple substitutions on some lines of my data file.
Thanks,
- Chris Koknat
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.