in reply to Re^3: Regex find and replace involving new line
in thread Regex find and replace involving new line

My data files are exactly similar to what I have shown in the initial file. I have just installed strawberry perl so don't know how to code. I want to perform find and replace in text file using perl regular expression. I am fan of PCRE and have some knowledge. While asking for command line alternative to notepad++, I was suggested, perl is good choice. I tried for following file

1

2

3

4

dog

1

2

3

4

5

puppy

7

8

I want to replace dog with puppy. Sometimes I want preserve lines between them too sometimes and wan't to remove the lines other times.

While using this one liner

perl -i.bak -pe "BEGIN{undef $/;} s/(dog)((.*[\r\n]+){6})(.*)/$1/smg" 4.txt it gives

1

2

3

4

dog

while notepad pcre gives desired result wile using find (dog)((.*[\r\n]+){6})(.*) and replace with \1

1

2

3

4

7

dog

8

like wise when I replace it with \2 it gave all the lines without dog and puppy. and while \3 gave me puppy in place of dog in above mentioned result. all behaves perfectly.

So the main thing that I want is to make grouping as follow. group 1= dog, Group 2 = 6 lines after found string i.e.dog. group 3 = whatever present on line 7. I want to be able to put back the groups as required.

Questions un answered

How to use grouping in perl while using regex that matches new line? Am I using rightly grouping in perl in my above code?

How to specify substitution? does substitution formula is correct as a whole in my code?

How to specified remembered groups? Am I making any mistake in substitution position while specifying group like $1, $2, $3 or sometimes $& (what is found)

I don't have sufficient knowledge to understand what you have suggested so I tried

perl -i.bak -pe "BEGIN{undef $/;} s/(dog)(.*([^\n]*\r?\n){6})(.*)/$1/smg" 4.txt

gave same results as with earlier oneliner

Replies are listed 'Best First'.
Re^5: Regex find and replace involving new line
by Corion (Patriarch) on Dec 14, 2015 at 11:54 UTC

    When I run your regular expression

    (dog)((.*[\r\n]+){6})(.*)

    in Notepad++ with the following text:

    1 2 3 4 dog 1 2 3 4 5 puppy 7 8

    I also get

    1 2 3 4 dog 7 8

    Which is what I would expect, because the replacement only specifies \1, not \2 or \3.

    To your questions:

    How to use grouping in perl while using regex that matches new line? Am I using rightly grouping in perl in my above code?

    Captured groups are referenced in Perl using $1 for the first opening parenthesis, $2 for the second opening parenthesis and so on. You've used $1 in your existing Perl code, and from your description I think you would also want to use $2 and $3 maybe.

    How to specify substitution? does substitution formula is correct as a whole in my code?

    You use the s/// operator, as you already do.

    How to specified remembered groups? Am I making any mistake in substitution position while specifying group like $1, $2, $3 or sometimes $& (what is found)

    You use $1 and its siblings.

    Assuming that the output text you want (which you do not show us, even though I had recommended this) is the following:

    1 2 3 4 dog=puppy 7 8

    the following works for me:

    s/(dog)(([^\n]*[\n]){6})(.*)/$1=$4/g;

    If my interpretation of what you want is wrong, maybe now is a good time to be more specific and show concrete examples of what you want, and the exact cases when you want the specific results. "Sometimes this and sometimes that" is not a specific explanation.

      Does (([^\n]*[\n]){6})(.*) always matches 6 lines?

      It behaves notoriously sometimes.

      I have string

      perl -i.bak -pe "BEGIN{undef $/;} s/\\cellx10464\\pard\\plain\\intbl\\s0\\ql\\fi0\\li0\\ri0\\sl320\\plain\\f4\\fs20\\b\\cf0 Patent Information\\b0([^\r?\n]*[\r?\n]){88}.*(EP \d{5,7})(([^\r?\n]*[\r?\n]){52}).*$2[\r\n]+\\cell\\pard\\plain\\intbl\\s0\\ql\\fi0\\li0\\ri0\\sl320\\plain\\f1\\fs20\\cf0 \\f1\\fs20\\cf0 (B\d?)[\r\n]+\\cell\\pard\\plain\\intbl\\s0\\ql\\fi0\\li0\\ri0\\sl320\\plain\\f1\\fs20\\cf0 [a-zA-Z]{3} [0-9,]{3} [0-9]{4}[\r\n]+\\cell\\pard\\plain\\intbl\\s0\\ql\\fi0\\li0\\ri0\\sl320\\plain\\f1\\fs20\\cf0 \\f1\\fs20\\cf0  [\r\n]+\\cell\\pard\\plain\\intbl\\s0\\ql\\fi0\\li0\\ri0\\plain/tttttt$2 $4/smg;" 1.rtf

      My rtf contains

      somestring(88 paragraphs mached as $1)(string $2)(52 paragraphs mached as string 3).*$2{means found string}[\r\n]+some string(string $4)somestring

      It doesn't give the desired result. I thing it matches first occurrence of first found and last occurence of last found and removes all the lines between that are iportant one.

      I am using windows strawberry perl. where am I making mistake?

        Does (([^\n]*[\n]){6})(.*) always matches 6 lines?

        That depends on what  . (dot) matches. By default, dot does not match a newline, but the  /s switch, which you appear to be using in your  s///smg substitution regex, causes dot to match everything, including newlines (see Modifiers). That means that  (.*) in the above quoted regex may match a great many lines!

        (([^\r?\n]*[\r?\n]){52}).*$2

        The  $2 capture variable seems to be part of this regex — it's very hard to read because the code is so dense! I can't think of any circumstance in which this would be correct. Did you mean  \2 instead?

        [^\r?\n]  [\r?\n]

        These two character classes include the  ? character. Are you aware that  ? has no special meaning in a character class? It simply represents the literal character '?'. These two character classes could be equivalently written as [^?\r\n] and [?\r\n]. Is this what you intend?

        In general, your code is so dense as to be unreadable. What is the point of writing this as a one-liner? Do yourself a big favor and write this in a separate source file, with lots of whitespace delimiting various parts of the regex (see  /x in Modifiers). If you (and the monks hereabouts) can see the regex, you (and we) may be better able to see the problems.

        Update: I also notice that in your
            s/\\cellx10464\\pard\\plain...\\plain/tttttt$2 $4/smg;
        eye-bezoggling one-liner regex, there are some big chunks of literal text, some of which repeat. Were I to re-write this as a source file, I might write something like

        my $text = ...; ... my $pard = '\pard\plain\intbl\s0\ql\fi0\li0\ri0'; $text =~ s{ \Q\cellx10464\E \Q$pard\E \Q\sl320\plain\f4\fs20\b\cf0 Patent Information\b0\E ([^\r\n]*[\r\n]){88} .* (EP\d{5,7}) (([^\r\n]*[\r\n]){52}) .* \2 [\r\n]+ ... \Q\cell\E \Q$pard\E \Q\plain\E } {tttttt$2 $4}xmsg; ...
        (with maybe some  # comments ... in there also). See Quote and Quote-like Operators for info on the  \Q \E interpolation control escape sequences. (Update: There are also a few examples of the use of  \Q \E in perlretut Part 2, in the section "More on characters, strings, and character classes".)

        And of course, always usewarnings; and usestrict; if you write this as a separate file — or even as a one-liner!


        Give a man a fish:  <%-{-{-{-<

        [^\r?\n]*[\r?\n]

        What is that part supposed to match?

        What you wrote there makes little to no sense in the context of trying to match lines. Please read perlre and perlretut to find out how character classes work. Wildly adding things to character classes usually makes things worse..

      That's the answer I wanted. Currently its doing job for me. I will make entire list working and then I will come back to you with specific problem. My find string is very big containing approximately 200 lines. Puting it here I guess will not be feasible. I am very new to the forum. Please suggest me how to give that string by any means

      It take too much time to test perl regex with command line. Is there any tool to test perl regex for windows?