in reply to Re^2: regex doubt on excluding
in thread regex doubt on excluding

Ok, I understand now, and it seems I spoke too soon: the original code is removing some newlines, since it reduces a sequence of successive newlines to a single one.

I don’t understand how this is working. From perlre#Regular-Expressions:

By default, the "^" character is guaranteed to match only the beginning of the string, the "$" character only the end (or before the newline at the end), and Perl does certain optimizations with the assumption that the string contains only one line. Embedded newlines will not be matched by "^" or "$". You may, however, wish to treat a string as a multi-line buffer, such that the "^" will match after any newline within the string (except if the newline is the last character in the string), and "$" will match before any newline. At the cost of a little more overhead, you can do this by using the /m modifier on the pattern match operator.

— but I don’t see how this explains the behaviour we are seeing?

Update: Ignore this “solution”, it doesn’t remove the whitespace! (Insufficient testing.)

In any case, one way to get the desired behaviour is to add a negative look-ahead assertion:

19:18 >perl -wE "my $s = qq[abc\n\t \ndef\n \n\n\ngh]; $s =~ s/^\s+$ +(?!$)//mg; say $s;" abc def gh 19:19 >

Can someone please explain what the regex is doing here?

Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

Replies are listed 'Best First'.
Re^4: regex doubt on excluding
by SuicideJunkie (Vicar) on Apr 21, 2014 at 14:49 UTC

    Well, the ^ could match the newline after "def", and the $ could match the newline before 'gh'. and all the newlines between those two are greedily accepted by the \s+ and thus eliminated.

    ^ matching *after* a newline means the first newline would not be included and eliminated. $ matching before a newline means the last newline is not eliminated either.

    Those two newlines make for one blank line between the non-blank lines, and any excess whitespace including newlines between them is removed.

      Hello SuicideJunkie, and thanks for the answer. Unfortunately, I’m still confused. :-(

      From your explanation, I would expect that making the whitespace match non-greedy would prevent the intermediate newline(s) from being eliminated. But it doesn’t (see below). Here is my current understanding (obviously flawed) of what should happen:

      • ^ and $ are zero-width assertions, so when they feature in a match the newline they follow/preceed is not substituted. For example:

        18:14 >perl -wE "my $s = qq[\n\n\n]; my $t = $s =~ s{$}{}gmr; say $s e +q $t;" 1 18:14 >
      • \s*? matches zero or more whitespace characters (including newline) non-greedily.

      • With the /g modifier in effect, whenever a match succeeds the regex engine begins looking for the next match one character past where the last successful match began.

      Given these assumptions, I would expect that the regex /^\s*?$/ would match the string "a\n\n\nb" as follows: First, ^ matches after the first newline. Since \s*? is non-greedy, the regex engine looks for the shortest match satisfying \s*?$, and finds it in the zero-length string between the first two newlines. This it replaces with another zero-length string. It then starts looking for the next match with ^ matching after the second newline. Again, it finds and replaces a zero-length string. Finaly, ^ matches after the final newline, but no match is found. Result: the string is unchanged. However:

      What am I missing?

      Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

        What am I missing?

        I think rxrx :) pos , @- and @+

        So it matched the zero length string, doesn't advance position, then matches one newline at same position thus advancing position, then it matches the zero length string again, and thats the end of matches

        "a\n\n\nb" s(2)e(2)pos(2)len(0) ("a\n", "", "\n\nb") s(2)e(3)pos(3)len(1) ("a\n", "\n", "\nb") s(3)e(3)pos(3)len(0) ("a\n\n", "", "\nb")

        I think that makes sense :)

      I just tried jellisii2's solution, s/^\s*$/\n/mg, and it worked without the non-greedy modifier. Each matching line was replaced by a single newline.

Re^4: regex doubt on excluding
by Anonymous Monk on Apr 21, 2014 at 09:47 UTC

    Thank you Athanasius.

    I got it worked with $string =~ s/\s*?\n/\n/mg;