in reply to Re: regexp: removing extra whitespace
in thread regexp: removing extra whitespace

Shouldn't - in theory - [^\S \n] match the same set as your \s(?<![ \n])  (\S being complementary to \s)?

Just tried it with my perl (v5.12.2), and [^\S \n] doesn't match \x{0085} and \x{00A0}, while \s(?<![ \n]) does.  Now I'm wondering why...

BTW, \v (\x{000B}) isn't matched in either case, here.

Replies are listed 'Best First'.
Re^3: regexp: removing extra whitespace
by ikegami (Patriarch) on Nov 04, 2011 at 20:57 UTC

    Yes, [^\S \n] and \s(?<![ \n]) are equivalent. Well, should be.

    Just tried it with my perl (v5.12.2), and [^\S \n] doesn't match \x{0085} and \x{00A0}

    Sometimes it won't because of a bug, but that applies to both [^\S \n] and \s(?<![ \n]). See Re: Can I change \s?.

    5.12 seems to have another problem on top of that.

    5.12:

    $ perl -le'print "\x{00A0}" =~ /[^\S \n]/ ?1:0;' 0 # Expected $ perl -E'say "\x{00A0}" =~ /[^\S \n]/ ?1:0;' 0 # Feature unicode_strings doesn't fix regexes yet. $ perl -le'print "\N{U+00A0}" =~ /[^\S \n]/ ?1:0;' 0 # Surprised! $ perl -le'print "\x{2660}\x{00A0}" =~ /[^\S \n]/ ?1:0;' 0 # Surprised!

    (Last two are really the same.)

    Now with what should be an equivalent pattern.

    $ perl -le'print "\x{00A0}" =~ /\s(?<![ \n])/ ?1:0;' 0 # Expected $ perl -E'say "\x{00A0}" =~ /\s(?<![ \n])/ ?1:0;' 0 # Feature unicode_strings doesn't fix regexes yet. $ perl -le'print "\N{U+00A0}" =~ /\s(?<![ \n])/ ?1:0;' 1 # \N always returns an upgraded string. $ perl -le'print "\x{2660}\x{00A0}" =~ /\s(?<![ \n])/ ?1:0;' 1 # Forces the use of an upgraded string.

    5.14:

    $ perl -le'print "\x{00A0}" =~ /[^\S \n]/ ?1:0;' 0 # Bug kept for backwards compatibility $ perl -E'say "\x{00A0}" =~ /[^\S \n]/ ?1:0;' 1 $ perl -le'print "\N{U+00A0}" =~ /[^\S \n]/ ?1:0;' 1 $ perl -le'print "\x{2660}\x{00A0}" =~ /[^\S \n]/ ?1:0;' 1
      ...because of a bug, but that applies to both [^\S \n] and \s(?<![ \n]).

      What surprises me is that the bug doesn't seem to apply to both (with my perl), i.e. it only shows up with the negated char class — from which it would follow that \S and \s aren't complementary...

      However, as the issue appears to be fixed in 5.14, I think we can leave it at that.

       

      (Update) But what about the vertical tab U+000B ?

      $ /usr/local/bin/perl5.14.1 -E'say "\x0b" =~ /\s/ ?1:0;' 0 $ /usr/local/bin/perl5.14.1 -E'say "\N{U+000B}" =~ /\s/ ?1:0;' 0

      Shouldn't it be considered white space?  (Not that I've ever encountered it in the wild... just curious.)

        What surprises me is that the bug doesn't seem to apply to both (with my perl), i.e. it only shows up with the negated char class

        There's a bug that affects both -- the first test in 5.14 and the first two in 5.12 -- and there's a bug that doesn't.

        You're quoting a comment about one when commenting on the other.

        But what about the vertical tab U+000B?

        It's in the Unicode property, but not \s.

        $ uniprops 0x0B U+000B ‹U+000B› \N{LINE TABULATION} \v \R \pC \p{Cc} All Any ASCII Assigned Basic_Latin C Other Cc Cntrl Common Zyyy Co +ntrol Pat_WS Pattern_White_Space PatWS POSIX_Cntrl POSIX_Space Space +VertSpace White_Space WSpace X_POSIX_Cntrl X_POSIX_Space $ perl -E'say "\x0B" =~ /\p{Space}/ ?1:0;' 1 $ perl -E'say "\x0B" =~ /\s/ ?1:0;' 1

        But I remember some characters not being in \s for historical reasons.

        $ diff -u0 <( unichars '\s' ) <( unichars '\p{Space}' ) --- /dev/fd/63 2011-11-04 21:18:53.160681893 -0400 +++ /dev/fd/62 2011-11-04 21:18:53.160681893 -0400 @@ -2,0 +3 @@ + ---- U+000B LINE TABULATION
      There is no \n{3,} in my code.

      That was an abbreviation. It was quicker to type \n{3,} than "it doesn't catch the 3 or more consecutive new lines."

      Here's what I need to accomplish with the regexp.
      /\s(?<![ \n])/ /(?<=\s|^) | (?=\s|$)/ /\n\n\K\n+/
      Unfortunately the second statement above doesn't work. I'm using perl v5.10.1 and it gives the following error when that statement is used.

      "Variable length lookbehind not implemented in regex /(?<=\s|^) | (?=\s|$)/ at test.pl line 10."

        In (?<=\s|^) \s is one character wide and ^ (an anchor) is 0 characters wide hence the "variable length" error.

        Update: (?<=(?<=\s)|^) may work for you though.

        True laziness is hard work