Search and Replace.

Doniv has asked for the wisdom of the Perl Monks concerning the following question:

Guys,

I have gone through many other forums and have been directed to this place. I hope somebody will help me on this question.

Cheers

I have split my question into 2 for easier reading.

Part 1

I am replacing the below of piece of code

perl -pi -e 's/[ ]*\~\t\~[ ]*/~/g' a.dat");
perl -pi -e 's\/^[ ]*\/\/g' a.dat");
perl -pi -e 's/\:000[A,P]M//g' a.dat");
perl -pi -e 's/99991231//g' a.dat");
perl -pi -e 's/Jan 1 1900 12:00:00//g' a.dat");
[download]

instead of

while (<IN_FILE>)
{
s/^[ ]*//g;
s/[ ]*$//g;
s/\:000[A,P]M//g;
s/99991231//g;
s/Jan 1 1900 12:00:00//g;
s/[ ]*\~\t\~[ ]*/~/g;
}
[download]

The data filesize is huge.
Both of these take almost the same time (25min). Is there a way I can reduce this time.

Part B

I am trying to date fields contain SUN then replace with blanks and using the below code.

Bascially I want to optimise the below piece of code with includes part A. Altogether it takes 35min.

Appreciate a reply from you guys.

Thanks

while (<IN_FILE>)
{
s/^[ ]*//g;
s/[ ]*$//g;
s/\:000[A,P]M//g;
s/99991231//g;
s/Jan 1 1900 12:00:00//g;
s/[ ]*\~\t\~[ ]*/~/g;
@INPUT_FIELDS = split /~/, $_;
for ($i=0; $i<=$#PROC_FIELDS; $i++) {
$MY_INDEX = $PROC_FIELDS[$i] - 1;
if ( $INPUT_FIELDS[$MY_INDEX] =~/SUN/ ) {
$INPUT_FIELDS[$MY_INDEX] = "";
}
}
$MYLINE = join "~", @INPUT_FIELDS;

print OUT_FILE $MYLINE;
}
close(IN_FILE);
close(OUT_FILE);
[download]

Comment on Search and Replace. Select or Download Code

Replies are listed 'Best First'.
Re: Search and Replace. by Roy Johnson (Monsignor) on Jun 08, 2005 at 12:54 UTC
It may help to combine the non-anchored regexes with one alternation: `s/\:000[A,P]M\|99991231\|Jan 1 1900 12:00:00//g;` [download] If, before the while loop, you computed `my @input_field_indexes = map($_-1, @PROC_FIELDS);` [download] then your for loop could be written `for my $field (@INPUT_FIELDS[ @input_field_indexes ]) { $field = '' if $field =~ /SUN/; }` [download] And you'd eliminated a lot of repetitive calculations, since @PROC_FIELDS doesn't change inside the while. Caution: Contents may have been coded under pressure.	[reply] [d/l] [select]
Re: Search and Replace. by ikegami (Patriarch) on Jun 08, 2005 at 14:24 UTC
Change `s/^[ ]//g;` `s/[ ]$//g;` to `s/^[ ]+//;` `s/[ ]+$//;` to avoid useless substitutions. I also removed the /g since there's only one start of line and one end of line per line (unless /m is also used). Are you sure you want `[A,P]M` (which matches "AM", ",M" and "PM") and not `[AP]M`? The slash in `\:000` is useless, but that won't affect anything (except readability). The slashes before the `~` characters are useless, but won't affect anything (except readability). If the file isn't too big to load into memory, you could try doing that. `local $_; # Read entire file into $_. { local $/; $_ = <IN_FILE>; } # Execute substitutions once each. s/^[ ]+//mg; s/[ ]+$//mg; s/ :000[AP]M \| 99991231 \| Jan\ 1\ 1900\ 12:00:00 //xg; s/[ ]~\t~[ ]/~/g; # Output entire file. print;` [download]	[reply] [d/l] [select]
Re: Search and Replace. by Random_Walk (Prior) on Jun 08, 2005 at 12:42 UTC
Hi Doniv Do you have a few lines of your input file so we can see how often these substitutions are done. If you know that a line containing one of the substitutions will certainly not contain another you way be able to skip some pattern matching. Cheers, R. Pereant, qui ante nos nostra dixerunt!	[reply]
Re: Search and Replace. by radiantmatrix (Parson) on Jun 08, 2005 at 14:17 UTC
Your while loop makes more sense, so I'm going to build from there. Look at your list of substitutions: `s/^[ ]//g; s/[ ]$//g; s/\:000[A,P]M//g; s/99991231//g; s/Jan 1 1900 12:00:00//g; s/[ ]\~\t\~[ ]/~/g;` [download] I see that all but the last one are removing text, so why not combine them into one pass? Also, youre first two appear to be designed to trim leading and trailing spaces from lines: there's a pretty standard way to do this (`s/^\s+\|\s+$//g`), so I took the liberty of using that structure below. My trailing 'x' causes whitespace to be ignored, so I've replaced whitespace in matches with '\s'; you may want \0x20 instead, I don't know. <update>Modified code with ikegami's suggestions from below. I chose '\ ' as the method to escape a space.</update> `while ( <IN_FILE> ) { s{ (?:^\ +\|\ +$) \|(?:\:000[A,P]M) \|(?:99991231) \|(?:Jan\ 1\ 1900\ 12:00:00) }{}gx; s/[ ]\~\t\~[ ]/~/g; print OUT_FILE $_; } close IN_FILE; ## and unlink() the filename for IN_FILE ## then rename() outfile to infile.` [download] This should reduce your exec time a little bit, as it makes two regex passes instead of six. However, I suspect the slowest thing going is really disk IO (it usually is, with file operations). Doing the "write to another file then rename" has typically been faster than in-place editing, for me. If you're reading your file over a network... well, don't -- make a local copy, process it, and pass it back to the network location. That will nearly always be much faster than streaming IO over a network. Yoda would agree with Perl design: there is no `try{}`	[reply] [d/l] [select]
Re^2: Search and Replace. by ikegami (Patriarch) on Jun 08, 2005 at 14:31 UTC
Don't replace spaces with \s. \s would be slower since it matches spaces, tabs, carriage returns (I think) and line feeds, and even more if it's a unicode string. "`\x20`", "`\040`", "`\`", and "`[ ]`" work. (I wonder if the last is slower than the others. I'll Benchmark later.) You've also added useless captures. Don't use `(...)` (which incures a speed penalty), use `(?:...)`.	[reply] [d/l] [select]
Re: Search and Replace. by QM (Parson) on Jun 08, 2005 at 15:38 UTC
To see if it's file IO that's slowing you down, compare the runtime of this snippet with your original: `while (<IN_FILE>) { print OUTFILE $_; }` [download] If there's no significant difference, then you're optimizing in the wrong place :( -QM -- Quantum Mechanics: The dreams stuff is made of	[reply] [d/l]
Re: Search and Replace. by hv (Prior) on Jun 08, 2005 at 15:48 UTC
In part 1, the only specifically slow substitution will be the one anchored to end of string, `s/[ ]$//g`. Since the following substitutions all shorten the string you'd get a small benefit from doing this last, so combining with the other responses I'd suggest: `while (<IN_FILE>) { s/^\ +//; s/:000[AP]M\|99991231\|Jan 1 1900 12:00:00//g; s/\ -\t-\ /-/g; s/\ +$//; }` [download] For the second part, you can save small amounts: substitute trailing space on the last field after splitting, instead of on the whole line; precalculate the values for `$PROC_FIELDS[$i] - 1`; avoid the `$i` loop index; avoid the block in the loop using the modifier form: `my @PROC_INDEXES = map $_ - 1, @PROC_FIELDS; while (<IN_FILE>) { s/^\ +//; s/:000[AP]M\|99991231\|Jan 1 1900 12:00:00//g; s/\ -\t-\ */-/g; my @INPUT_FIELDS = split /-/, $_; $INPUT_FIELDS[$#INPUT_FIELDS] =~ s/\ +$//; /SUN/ && ($_ = '') for @INPUT_FIELDS[@PROC_INDEXES]; print OUT_FILE join '-', @INPUT_FIELDS; } close IN_FILE; close OUT_FILE;` [download] In any case, though, all of these savings are quite minor: I suspect the majority of the time is being taken up with unavoidable I/O. Hugo	[reply] [d/l] [select]
Re: Search and Replace. by TedPride (Priest) on Jun 08, 2005 at 16:52 UTC
You can probably speed up file I/O a bit by reading x number of bytes plus the rest of the current line, then splitting on line endings and running through the lines in array form. This allows you to scale memory use to whatever your system can handle, unlike line-by-line reading or reading the entire file.	[reply]
Re^2: Search and Replace. by QM (Parson) on Jun 08, 2005 at 19:49 UTC
I'm no expert, but file I/O should be buffered so this doesn't matter. I would think it's only an issue if the line lengths are on the order of the buffer size or larger. For this application, it sounds like the entire line is needed anyway, so there's still little difference. (May some gracious monk will correct me if I'm wrong.) -QM -- Quantum Mechanics: The dreams stuff is made of	[reply]
Re: Search and Replace. by TheStudent (Scribe) on Jun 08, 2005 at 11:59 UTC
Please use < code> and </code> tags around your code to make it readable. As it currently is I doubt that you will get much (if any) help. TheStudent	[reply]
A reply falls below the community's threshold of quality. You may see it by logging in.