in reply to regexp: removing extra whitespace

s/\s(?<![ \n])//g; s/ \K +//g; s/\n\n\K\n+//g;

The order of the first two matters (e.g. foo{space}{tab}{space}bar). I gave them in the same order you requested them.


I find it odd that foo{tab}bar should become foobar. One usually wants foo{space}bar. To get the latter,

s/(?:\s(?<![ \n]))+/ /g; s/\n\n\K\n+//g;

\s(?<![ \n])

is currently equivalent to

[\x{0009}\x{000B}-\x{000D}\x{0085}\x{00A0}\x{1680}\x{180E}\x{2000}-\x{ +200A}\x{2028}\x{2029}\x{202F}\x{205F}\x{3000}]

or sometimes the buggy

[\x{0009}\x{000B}-\x{000D}\x{1680}\x{180E}\x{2000}-\x{200A}\x{2028}\x{ +2029}\x{202F}\x{205F}\x{3000}]

Update: While U+000B is considered a space by Unicode and \p{Space}, it's not considered a space by \s for historical reasons.

Replies are listed 'Best First'.
Re^2: regexp: removing extra whitespace
by Eliya (Vicar) on Nov 04, 2011 at 19:52 UTC

    Shouldn't - in theory - [^\S \n] match the same set as your \s(?<![ \n])  (\S being complementary to \s)?

    Just tried it with my perl (v5.12.2), and [^\S \n] doesn't match \x{0085} and \x{00A0}, while \s(?<![ \n]) does.  Now I'm wondering why...

    BTW, \v (\x{000B}) isn't matched in either case, here.

      Yes, [^\S \n] and \s(?<![ \n]) are equivalent. Well, should be.

      Just tried it with my perl (v5.12.2), and [^\S \n] doesn't match \x{0085} and \x{00A0}

      Sometimes it won't because of a bug, but that applies to both [^\S \n] and \s(?<![ \n]). See Re: Can I change \s?.

      5.12 seems to have another problem on top of that.

      5.12:

      $ perl -le'print "\x{00A0}" =~ /[^\S \n]/ ?1:0;' 0 # Expected $ perl -E'say "\x{00A0}" =~ /[^\S \n]/ ?1:0;' 0 # Feature unicode_strings doesn't fix regexes yet. $ perl -le'print "\N{U+00A0}" =~ /[^\S \n]/ ?1:0;' 0 # Surprised! $ perl -le'print "\x{2660}\x{00A0}" =~ /[^\S \n]/ ?1:0;' 0 # Surprised!

      (Last two are really the same.)

      Now with what should be an equivalent pattern.

      $ perl -le'print "\x{00A0}" =~ /\s(?<![ \n])/ ?1:0;' 0 # Expected $ perl -E'say "\x{00A0}" =~ /\s(?<![ \n])/ ?1:0;' 0 # Feature unicode_strings doesn't fix regexes yet. $ perl -le'print "\N{U+00A0}" =~ /\s(?<![ \n])/ ?1:0;' 1 # \N always returns an upgraded string. $ perl -le'print "\x{2660}\x{00A0}" =~ /\s(?<![ \n])/ ?1:0;' 1 # Forces the use of an upgraded string.

      5.14:

      $ perl -le'print "\x{00A0}" =~ /[^\S \n]/ ?1:0;' 0 # Bug kept for backwards compatibility $ perl -E'say "\x{00A0}" =~ /[^\S \n]/ ?1:0;' 1 $ perl -le'print "\N{U+00A0}" =~ /[^\S \n]/ ?1:0;' 1 $ perl -le'print "\x{2660}\x{00A0}" =~ /[^\S \n]/ ?1:0;' 1
        ...because of a bug, but that applies to both [^\S \n] and \s(?<![ \n]).

        What surprises me is that the bug doesn't seem to apply to both (with my perl), i.e. it only shows up with the negated char class — from which it would follow that \S and \s aren't complementary...

        However, as the issue appears to be fixed in 5.14, I think we can leave it at that.

         

        (Update) But what about the vertical tab U+000B ?

        $ /usr/local/bin/perl5.14.1 -E'say "\x0b" =~ /\s/ ?1:0;' 0 $ /usr/local/bin/perl5.14.1 -E'say "\N{U+000B}" =~ /\s/ ?1:0;' 0

        Shouldn't it be considered white space?  (Not that I've ever encountered it in the wild... just curious.)

        There is no \n{3,} in my code.

        That was an abbreviation. It was quicker to type \n{3,} than "it doesn't catch the 3 or more consecutive new lines."

        Here's what I need to accomplish with the regexp.
        /\s(?<![ \n])/ /(?<=\s|^) | (?=\s|$)/ /\n\n\K\n+/
        Unfortunately the second statement above doesn't work. I'm using perl v5.10.1 and it gives the following error when that statement is used.

        "Variable length lookbehind not implemented in regex /(?<=\s|^) | (?=\s|$)/ at test.pl line 10."
Re^2: regexp: removing extra whitespace
by perlmax (Initiate) on Nov 04, 2011 at 19:43 UTC
    Thanks for your response. I tried the regexp you posted and I'm still encountering a few problems. There are many lines that only contain a single space. Since there's still a space on the line it prevents the regexp from catching the \n{3,} occurrences. I should have specified in my original post that I need to catch all spaces that are preceded or followed by additional whitespace. Instead of just {space}{space} it should also check for {space}\s. How can I revise the regexp you posted to include that functionality? Document example after the regexp:
    \n{space} \n{space} \n{space}
    All of the tabs in the document appear after a new line and since I'm preserving the new line characters I'm not worried about replacing the tabs with a space. Any tabs found in the middle of a line would be accidental but I'd still like to check for them and remove them if found.

      Since there's still a space on the line it prevents the regexp from catching the \n{3,} occurrences.

      There is no \n{3,} in my code. As for non empty line not getting deleted, that's consistent with what you asked. Are you now asking to consider lines with just whitespace to be empty?

      I need to catch all spaces that are preceded or followed by additional whitespace. Instead of just {space}{space} it should also check for {space}\s

      That makes no sense. That says that {space}{space} should be collapsed to a space (which happens) and that {space}{newline} should be collapsed to {space} (which contradicts what you did say and makes no sense).