in reply to Re^3: regexp: removing extra whitespace
in thread regexp: removing extra whitespace

...because of a bug, but that applies to both [^\S \n] and \s(?<![ \n]).

What surprises me is that the bug doesn't seem to apply to both (with my perl), i.e. it only shows up with the negated char class — from which it would follow that \S and \s aren't complementary...

However, as the issue appears to be fixed in 5.14, I think we can leave it at that.

 

(Update) But what about the vertical tab U+000B ?

$ /usr/local/bin/perl5.14.1 -E'say "\x0b" =~ /\s/ ?1:0;' 0 $ /usr/local/bin/perl5.14.1 -E'say "\N{U+000B}" =~ /\s/ ?1:0;' 0

Shouldn't it be considered white space?  (Not that I've ever encountered it in the wild... just curious.)

Replies are listed 'Best First'.
Re^5: regexp: removing extra whitespace
by ikegami (Patriarch) on Nov 05, 2011 at 01:11 UTC

    What surprises me is that the bug doesn't seem to apply to both (with my perl), i.e. it only shows up with the negated char class

    There's a bug that affects both -- the first test in 5.14 and the first two in 5.12 -- and there's a bug that doesn't.

    You're quoting a comment about one when commenting on the other.

    But what about the vertical tab U+000B?

    It's in the Unicode property, but not \s.

    $ uniprops 0x0B U+000B ‹U+000B› \N{LINE TABULATION} \v \R \pC \p{Cc} All Any ASCII Assigned Basic_Latin C Other Cc Cntrl Common Zyyy Co +ntrol Pat_WS Pattern_White_Space PatWS POSIX_Cntrl POSIX_Space Space +VertSpace White_Space WSpace X_POSIX_Cntrl X_POSIX_Space $ perl -E'say "\x0B" =~ /\p{Space}/ ?1:0;' 1 $ perl -E'say "\x0B" =~ /\s/ ?1:0;' 1

    But I remember some characters not being in \s for historical reasons.

    $ diff -u0 <( unichars '\s' ) <( unichars '\p{Space}' ) --- /dev/fd/63 2011-11-04 21:18:53.160681893 -0400 +++ /dev/fd/62 2011-11-04 21:18:53.160681893 -0400 @@ -2,0 +3 @@ + ---- U+000B LINE TABULATION