Efficient method to replace middle lines only when no match

Zu has asked for the wisdom of the Perl Monks concerning the following question:

I have a file whose contents have been slurped into a string. I'm trying to find an efficient solution that will replace content between two matching lines with different content, but only if there isn't a matching line in between.

So if the relevant portion of a proper file looked like this:

field1: valueA
field2: valueB
field3: valueC
[download]

then I don't want to make changes. But if it looked like this:

field1: valueA
some
lines
here
field3: valueC
[download]

then I want to correct it to look like the output of the first example.

The field names are fixed but the values could be anything. The regex below works, but when field2 exists it's quite slow. Likely due to the combination of greedy .* and negative look-ahead.

# this is quick
my $no_field2 = "field1: valueA\nsome\nlines\nhere\nfield3: valueC\n";
$no_field2 .= "........................................\n" x 1000;

$no_field2 =~ s/(field1:.*?$)\n(?!^field2:)(.*$)\n(^field3:)
               /$1\nfield2: valueB\n$3
               /msx;

# this is slow
my $has_field2 = "field1: valueA\nfield2: valueB\nfield3: valueC\n";
$has_field2 .= "........................................\n" x 1000;

$has_field2 =~ s/(field1:.*?$)\n(?!^field2:)(.*$)\n(^field3:)
                /$1\nfield2: valueB\n$3
                /msx;
[download]

What is the most efficient, least code solution?

Comment on Efficient method to replace middle lines only when no match Select or Download Code

Replies are listed 'Best First'.
Re: Efficient method to replace middle lines only when no match by AnomalousMonk (Archbishop) on Mar 19, 2014 at 02:47 UTC
A line-by-line approach might actually be more efficient/maintainable, but if the file's already slurped, maybe something like this (needs Perl 5.10+ for `\K` but this could be worked around): `c:\@Work\Perl\monks>perl -wMstrict -le "use 5.010; ;; my $s = qq{yada\n} . qq{field1: valueA\n} . qq{some\n} . qq{lines\n} . qq{here\n} . qq{field3: valueC\n} . qq{blah blah\n} ; print qq{[[$s]]}; ;; my $replace = qq{field2: valueB\n}; ;; $s =~ s{ ^ field1: \s valueA \n \K (?! \Q$replace\E) .? (?= field3: \s valueC \n) } {$replace}xmsg; print qq{[[$s]]}; " [[yada field1: valueA some lines here field3: valueC blah blah ]] [[yada field1: valueA field2: valueB field3: valueC blah blah ]]` [download] Update 1:* Changed `(?! $replace)` to `(?! \Q$replace\E)` Update 2: I don't know if I just missed it or it was added after I began my reply, but the code at the bottom of the OP looks a lot like my post, and in fact is better since it takes care of variable value fields as mine does not. ?!?	[reply] [d/l] [select]
Re^2: Efficient method to replace middle lines only when no match by Zu (Initiate) on Mar 20, 2014 at 06:54 UTC
Thanks for using \K in your post, it made me think a bit more and I came up with a solution that doesn't have a penalty when there's a negative match: `# slow no longer my $has_field2 = "field1: valueA\nfield2: valueB\nfield3: valueC\n"; $has_field2 .= "........................................\n" x 1000; $has_field2 =~ s/field1:[\n]\n\K(?!field2:).(?=\nfield3:) /field2: valueB /msx;` [download] Also, I didn't edit my original post.	[reply] [d/l]
Re^3: Efficient method to replace middle lines only when no match by AnomalousMonk (Archbishop) on Mar 20, 2014 at 15:49 UTC
`$has_field2 =~ s/field1:[\n]\n\K(?!field2:).(?=\nfield3:) /field2: valueB /msx;` [download] Shouldn't the `[\n]\n` after `field1:` in the search pattern be `[^\n]\n` (negated class) instead? The `.` in the search pattern is greedy. Your test string has only one occurrence of the `"field1: valueA\nfield2: valueB\nfield3: valueC\n"` substring. What happens if you test against many occurrences (see below — BTW, I have an updated version of this script that, among other things, uses hi-res timing if you're interested)? Won't the greedy `.` just gobble all intervening occurrences? You have a newline and a lot of blank space in your replacement string; is this what you want? `c:\@Work\Perl\monks\Zu>perl -wMstrict -le "my $s = 'xxxyxxx'; $s =~ s/y /FOO /xms; print qq{'$s'}; " 'xxxFOO xxx'` [download]	[reply] [d/l] [select]
Re: Efficient method to replace middle lines only when no match by AnomalousMonk (Archbishop) on Mar 19, 2014 at 03:55 UTC
Ok, with a better understanding of the requirements, here's a solution that seems sufficiently fast: I'm just timing to 1 sec resolution, but in general, `s///` with replacements is a little over 1 sec, without replacements a little under. To me, that seems pretty fast for a 21M timing test string. Read more... (2 kB)	[reply] [d/l] [select]
Re: Efficient method to replace middle lines only when no match by kcott (Archbishop) on Mar 19, 2014 at 06:34 UTC
G'day Zu, Welcome to the monastery. I get the feeling that there's some aspect of this that you haven't told us about. For both strings, you have three capture groups but discard `$2` in each case. Also, your replacements both have "`\nfield2: valueB\n`" hard-coded. However, you've said "The regex below works"; on that basis, this solution is not "slow" for either string. #!/usr/bin/env perl -l use strict; use warnings; use Time::HiRes qw{time}; my $no_field2 = "field1: valueA\nsome\nlines\nhere\nfield3: valueC\n" . "........................................\n" x 1000; my $has_field2 = "field1: valueA\nfield2: valueB\nfield3: valueC\n" . "........................................\n" x 1000; my $middle = "\nfield2: valueB\n"; my $re = qr{(^field1:.?$).?(^field3:)}ms; replace_middle($_, $middle, $re) for ($no_field2, $has_field2); sub replace_middle { my ($string, $middle, $re) = @_; print '-' x 40; print "Start:\n", substr $string, 0, 60; my $t0 = time; $string =~ s/$re/$1$middle$2/; my $t1 = time; print "Finish:\n", substr $string, 0, 60; print 'Time: ', $t1 - $t0; } [download] Output: `---------------------------------------- Start: field1: valueA some lines here field3: valueC .............. Finish: field1: valueA field2: valueB field3: valueC ............... Time: 6.89029693603516e-05 ---------------------------------------- Start: field1: valueA field2: valueB field3: valueC ............... Finish: field1: valueA field2: valueB field3: valueC ............... Time: 1.62124633789062e-05` [download] -- Ken	[reply] [d/l] [select]
Re^2: Efficient method to replace middle lines only when no match by Zu (Initiate) on Mar 20, 2014 at 07:21 UTC
Thanks for your post, Ken. I didn't see it until later, perhaps I didn't reload this page correctly. I did have a second capture group but it was extraneous. I had been testing simply using my machine's "time" command and the has_field2 version was would routinely take 4+ times longer than the no_field2 on relatively small files. On multi-megabyte files it was a disaster. Your timing is obviously much more accurate. Based on an earlier post I changed the RE (and eliminated the second capture group): `my $re = qr{field1:[^\n]\n\K(?!^field2:).(?=\nfield3:)}ms;` Which performs as I would expect - dramatically better. Thanks for your help!	[reply] [d/l]