in reply to Re^5: Regex find and replace involving new line
in thread Regex find and replace involving new line

Does (([^\n]*[\n]){6})(.*) always matches 6 lines?

It behaves notoriously sometimes.

I have string

perl -i.bak -pe "BEGIN{undef $/;} s/\\cellx10464\\pard\\plain\\intbl\\s0\\ql\\fi0\\li0\\ri0\\sl320\\plain\\f4\\fs20\\b\\cf0 Patent Information\\b0([^\r?\n]*[\r?\n]){88}.*(EP \d{5,7})(([^\r?\n]*[\r?\n]){52}).*$2[\r\n]+\\cell\\pard\\plain\\intbl\\s0\\ql\\fi0\\li0\\ri0\\sl320\\plain\\f1\\fs20\\cf0 \\f1\\fs20\\cf0 (B\d?)[\r\n]+\\cell\\pard\\plain\\intbl\\s0\\ql\\fi0\\li0\\ri0\\sl320\\plain\\f1\\fs20\\cf0 [a-zA-Z]{3} [0-9,]{3} [0-9]{4}[\r\n]+\\cell\\pard\\plain\\intbl\\s0\\ql\\fi0\\li0\\ri0\\sl320\\plain\\f1\\fs20\\cf0 \\f1\\fs20\\cf0  [\r\n]+\\cell\\pard\\plain\\intbl\\s0\\ql\\fi0\\li0\\ri0\\plain/tttttt$2 $4/smg;" 1.rtf

My rtf contains

somestring(88 paragraphs mached as $1)(string $2)(52 paragraphs mached as string 3).*$2{means found string}[\r\n]+some string(string $4)somestring

It doesn't give the desired result. I thing it matches first occurrence of first found and last occurence of last found and removes all the lines between that are iportant one.

I am using windows strawberry perl. where am I making mistake?

Replies are listed 'Best First'.
Re^7: Regex find and replace involving new line
by AnomalousMonk (Archbishop) on Dec 15, 2015 at 13:17 UTC
    Does (([^\n]*[\n]){6})(.*) always matches 6 lines?

    That depends on what  . (dot) matches. By default, dot does not match a newline, but the  /s switch, which you appear to be using in your  s///smg substitution regex, causes dot to match everything, including newlines (see Modifiers). That means that  (.*) in the above quoted regex may match a great many lines!

    (([^\r?\n]*[\r?\n]){52}).*$2

    The  $2 capture variable seems to be part of this regex — it's very hard to read because the code is so dense! I can't think of any circumstance in which this would be correct. Did you mean  \2 instead?

    [^\r?\n]  [\r?\n]

    These two character classes include the  ? character. Are you aware that  ? has no special meaning in a character class? It simply represents the literal character '?'. These two character classes could be equivalently written as [^?\r\n] and [?\r\n]. Is this what you intend?

    In general, your code is so dense as to be unreadable. What is the point of writing this as a one-liner? Do yourself a big favor and write this in a separate source file, with lots of whitespace delimiting various parts of the regex (see  /x in Modifiers). If you (and the monks hereabouts) can see the regex, you (and we) may be better able to see the problems.

    Update: I also notice that in your
        s/\\cellx10464\\pard\\plain...\\plain/tttttt$2 $4/smg;
    eye-bezoggling one-liner regex, there are some big chunks of literal text, some of which repeat. Were I to re-write this as a source file, I might write something like

    my $text = ...; ... my $pard = '\pard\plain\intbl\s0\ql\fi0\li0\ri0'; $text =~ s{ \Q\cellx10464\E \Q$pard\E \Q\sl320\plain\f4\fs20\b\cf0 Patent Information\b0\E ([^\r\n]*[\r\n]){88} .* (EP\d{5,7}) (([^\r\n]*[\r\n]){52}) .* \2 [\r\n]+ ... \Q\cell\E \Q$pard\E \Q\plain\E } {tttttt$2 $4}xmsg; ...
    (with maybe some  # comments ... in there also). See Quote and Quote-like Operators for info on the  \Q \E interpolation control escape sequences. (Update: There are also a few examples of the use of  \Q \E in perlretut Part 2, in the section "More on characters, strings, and character classes".)

    And of course, always usewarnings; and usestrict; if you write this as a separate file — or even as a one-liner!


    Give a man a fish:  <%-{-{-{-<

Re^7: Regex find and replace involving new line
by Corion (Patriarch) on Dec 15, 2015 at 09:21 UTC
    [^\r?\n]*[\r?\n]

    What is that part supposed to match?

    What you wrote there makes little to no sense in the context of trying to match lines. Please read perlre and perlretut to find out how character classes work. Wildly adding things to character classes usually makes things worse..

      Thanks for reply. I will learn and get back