in reply to Re^2: regexp: removing extra whitespace
in thread regexp: removing extra whitespace

Yes, [^\S \n] and \s(?<![ \n]) are equivalent. Well, should be.

Just tried it with my perl (v5.12.2), and [^\S \n] doesn't match \x{0085} and \x{00A0}

Sometimes it won't because of a bug, but that applies to both [^\S \n] and \s(?<![ \n]). See Re: Can I change \s?.

5.12 seems to have another problem on top of that.

5.12:

$ perl -le'print "\x{00A0}" =~ /[^\S \n]/ ?1:0;' 0 # Expected $ perl -E'say "\x{00A0}" =~ /[^\S \n]/ ?1:0;' 0 # Feature unicode_strings doesn't fix regexes yet. $ perl -le'print "\N{U+00A0}" =~ /[^\S \n]/ ?1:0;' 0 # Surprised! $ perl -le'print "\x{2660}\x{00A0}" =~ /[^\S \n]/ ?1:0;' 0 # Surprised!

(Last two are really the same.)

Now with what should be an equivalent pattern.

$ perl -le'print "\x{00A0}" =~ /\s(?<![ \n])/ ?1:0;' 0 # Expected $ perl -E'say "\x{00A0}" =~ /\s(?<![ \n])/ ?1:0;' 0 # Feature unicode_strings doesn't fix regexes yet. $ perl -le'print "\N{U+00A0}" =~ /\s(?<![ \n])/ ?1:0;' 1 # \N always returns an upgraded string. $ perl -le'print "\x{2660}\x{00A0}" =~ /\s(?<![ \n])/ ?1:0;' 1 # Forces the use of an upgraded string.

5.14:

$ perl -le'print "\x{00A0}" =~ /[^\S \n]/ ?1:0;' 0 # Bug kept for backwards compatibility $ perl -E'say "\x{00A0}" =~ /[^\S \n]/ ?1:0;' 1 $ perl -le'print "\N{U+00A0}" =~ /[^\S \n]/ ?1:0;' 1 $ perl -le'print "\x{2660}\x{00A0}" =~ /[^\S \n]/ ?1:0;' 1

Replies are listed 'Best First'.
Re^4: regexp: removing extra whitespace
by Eliya (Vicar) on Nov 04, 2011 at 21:28 UTC
    ...because of a bug, but that applies to both [^\S \n] and \s(?<![ \n]).

    What surprises me is that the bug doesn't seem to apply to both (with my perl), i.e. it only shows up with the negated char class — from which it would follow that \S and \s aren't complementary...

    However, as the issue appears to be fixed in 5.14, I think we can leave it at that.

     

    (Update) But what about the vertical tab U+000B ?

    $ /usr/local/bin/perl5.14.1 -E'say "\x0b" =~ /\s/ ?1:0;' 0 $ /usr/local/bin/perl5.14.1 -E'say "\N{U+000B}" =~ /\s/ ?1:0;' 0

    Shouldn't it be considered white space?  (Not that I've ever encountered it in the wild... just curious.)

      What surprises me is that the bug doesn't seem to apply to both (with my perl), i.e. it only shows up with the negated char class

      There's a bug that affects both -- the first test in 5.14 and the first two in 5.12 -- and there's a bug that doesn't.

      You're quoting a comment about one when commenting on the other.

      But what about the vertical tab U+000B?

      It's in the Unicode property, but not \s.

      $ uniprops 0x0B U+000B ‹U+000B› \N{LINE TABULATION} \v \R \pC \p{Cc} All Any ASCII Assigned Basic_Latin C Other Cc Cntrl Common Zyyy Co +ntrol Pat_WS Pattern_White_Space PatWS POSIX_Cntrl POSIX_Space Space +VertSpace White_Space WSpace X_POSIX_Cntrl X_POSIX_Space $ perl -E'say "\x0B" =~ /\p{Space}/ ?1:0;' 1 $ perl -E'say "\x0B" =~ /\s/ ?1:0;' 1

      But I remember some characters not being in \s for historical reasons.

      $ diff -u0 <( unichars '\s' ) <( unichars '\p{Space}' ) --- /dev/fd/63 2011-11-04 21:18:53.160681893 -0400 +++ /dev/fd/62 2011-11-04 21:18:53.160681893 -0400 @@ -2,0 +3 @@ + ---- U+000B LINE TABULATION
Re^4: regexp: removing extra whitespace
by perlmax (Initiate) on Nov 04, 2011 at 21:37 UTC
    There is no \n{3,} in my code.

    That was an abbreviation. It was quicker to type \n{3,} than "it doesn't catch the 3 or more consecutive new lines."

    Here's what I need to accomplish with the regexp.
    /\s(?<![ \n])/ /(?<=\s|^) | (?=\s|$)/ /\n\n\K\n+/
    Unfortunately the second statement above doesn't work. I'm using perl v5.10.1 and it gives the following error when that statement is used.

    "Variable length lookbehind not implemented in regex /(?<=\s|^) | (?=\s|$)/ at test.pl line 10."

      In (?<=\s|^) \s is one character wide and ^ (an anchor) is 0 characters wide hence the "variable length" error.

      Update: (?<=(?<=\s)|^) may work for you though.

      True laziness is hard work

        I'd go with

        (?<!\S)

        so

        s/[^\S \n]//g; s/(?<!\S) +| +(?!\S)//g; s/\n\n\K\n+//g;
        (?<=(?<=\s)|^) may work for you though.
        That fixed it! Thanks.