Re: regexp: removing extra whitespace

s/\s(?<![ \n])//g;
s/ \K +//g;
s/\n\n\K\n+//g;
[download]

The order of the first two matters (e.g. foo{space}{tab}{space}bar). I gave them in the same order you requested them.

I find it odd that foo{tab}bar should become foobar. One usually wants foo{space}bar. To get the latter,

s/(?:\s(?<![ \n]))+/ /g;
s/\n\n\K\n+//g;
[download]

\s(?<![ \n])
[download]

is currently equivalent to

[\x{0009}\x{000B}-\x{000D}\x{0085}\x{00A0}\x{1680}\x{180E}\x{2000}-\x{
+200A}\x{2028}\x{2029}\x{202F}\x{205F}\x{3000}]
[download]

or sometimes the buggy

[\x{0009}\x{000B}-\x{000D}\x{1680}\x{180E}\x{2000}-\x{200A}\x{2028}\x{
+2029}\x{202F}\x{205F}\x{3000}]
[download]

Update: While U+000B is considered a space by Unicode and \p{Space}, it's not considered a space by \s for historical reasons.

Comment on Re: regexp: removing extra whitespace Select or Download Code

Replies are listed 'Best First'.
Re^2: regexp: removing extra whitespace by Eliya (Vicar) on Nov 04, 2011 at 19:52 UTC
Shouldn't - in theory - `[^\S \n]` match the same set as your `\s(?<![ \n])` (`\S` being complementary to `\s`)? Just tried it with my perl (v5.12.2), and `[^\S \n]` doesn't match `\x{0085}` and `\x{00A0}`, while `\s(?<![ \n])` does. Now I'm wondering why... BTW, `\v` (`\x{000B}`) isn't matched in either case, here.	[reply] [d/l] [select]
Re^3: regexp: removing extra whitespace by ikegami (Patriarch) on Nov 04, 2011 at 20:57 UTC
Yes, `[^\S \n]` and `\s(?<![ \n])` are equivalent. Well, should be. Just tried it with my perl (v5.12.2), and `[^\S \n]` doesn't match \x{0085} and \x{00A0} Sometimes it won't because of a bug, but that applies to both `[^\S \n]` and `\s(?<![ \n])`. See Re: Can I change \s?. 5.12 seems to have another problem on top of that. 5.12: `$ perl -le'print "\x{00A0}" =~ /[^\S \n]/ ?1:0;' 0 # Expected $ perl -E'say "\x{00A0}" =~ /[^\S \n]/ ?1:0;' 0 # Feature unicode_strings doesn't fix regexes yet. $ perl -le'print "\N{U+00A0}" =~ /[^\S \n]/ ?1:0;' 0 # Surprised! $ perl -le'print "\x{2660}\x{00A0}" =~ /[^\S \n]/ ?1:0;' 0 # Surprised!` [download] (Last two are really the same.) Now with what should be an equivalent pattern. `$ perl -le'print "\x{00A0}" =~ /\s(?<![ \n])/ ?1:0;' 0 # Expected $ perl -E'say "\x{00A0}" =~ /\s(?<![ \n])/ ?1:0;' 0 # Feature unicode_strings doesn't fix regexes yet. $ perl -le'print "\N{U+00A0}" =~ /\s(?<![ \n])/ ?1:0;' 1 # \N always returns an upgraded string. $ perl -le'print "\x{2660}\x{00A0}" =~ /\s(?<![ \n])/ ?1:0;' 1 # Forces the use of an upgraded string.` [download] 5.14: `$ perl -le'print "\x{00A0}" =~ /[^\S \n]/ ?1:0;' 0 # Bug kept for backwards compatibility $ perl -E'say "\x{00A0}" =~ /[^\S \n]/ ?1:0;' 1 $ perl -le'print "\N{U+00A0}" =~ /[^\S \n]/ ?1:0;' 1 $ perl -le'print "\x{2660}\x{00A0}" =~ /[^\S \n]/ ?1:0;' 1` [download]	[reply] [d/l] [select]
Re^4: regexp: removing extra whitespace by Eliya (Vicar) on Nov 04, 2011 at 21:28 UTC
...because of a bug, but that applies to both `[^\S \n]` and `\s(?<![ \n])`. What surprises me is that the bug doesn't seem to apply to both (with my perl), i.e. it only shows up with the negated char class — from which it would follow that `\S` and `\s` aren't complementary... However, as the issue appears to be fixed in 5.14, I think we can leave it at that. (Update) But what about the vertical tab `U+000B` ? `$ /usr/local/bin/perl5.14.1 -E'say "\x0b" =~ /\s/ ?1:0;' 0 $ /usr/local/bin/perl5.14.1 -E'say "\N{U+000B}" =~ /\s/ ?1:0;' 0` [download] Shouldn't it be considered white space? (Not that I've ever encountered it in the wild... just curious.)	[reply] [d/l] [select]
Re^5: regexp: removing extra whitespace by ikegami (Patriarch) on Nov 05, 2011 at 01:11 UTC
Re^4: regexp: removing extra whitespace by perlmax (Initiate) on Nov 04, 2011 at 21:37 UTC
There is no \n{3,} in my code. That was an abbreviation. It was quicker to type \n{3,} than "it doesn't catch the 3 or more consecutive new lines." Here's what I need to accomplish with the regexp. `/\s(?<![ \n])/ /(?<=\s\|^) \| (?=\s\|$)/ /\n\n\K\n+/` [download] Unfortunately the second statement above doesn't work. I'm using perl v5.10.1 and it gives the following error when that statement is used. "Variable length lookbehind not implemented in regex /(?<=\s\|^) \| (?=\s\|$)/ at test.pl line 10."	[reply] [d/l]
Re^5: regexp: removing extra whitespace by GrandFather (Saint) on Nov 04, 2011 at 21:41 UTC
Re^6: regexp: removing extra whitespace by ikegami (Patriarch) on Nov 05, 2011 at 01:23 UTC
Re^6: regexp: removing extra whitespace by perlmax (Initiate) on Nov 04, 2011 at 23:24 UTC
Re^2: regexp: removing extra whitespace by perlmax (Initiate) on Nov 04, 2011 at 19:43 UTC
Thanks for your response. I tried the regexp you posted and I'm still encountering a few problems. There are many lines that only contain a single space. Since there's still a space on the line it prevents the regexp from catching the `\n{3,}` occurrences. I should have specified in my original post that I need to catch all spaces that are preceded or followed by additional whitespace. Instead of just {space}{space} it should also check for {space}\s. How can I revise the regexp you posted to include that functionality? Document example after the regexp: `\n{space} \n{space} \n{space}` [download] All of the tabs in the document appear after a new line and since I'm preserving the new line characters I'm not worried about replacing the tabs with a space. Any tabs found in the middle of a line would be accidental but I'd still like to check for them and remove them if found.	[reply] [d/l] [select]
Re^3: regexp: removing extra whitespace by ikegami (Patriarch) on Nov 04, 2011 at 20:52 UTC
Since there's still a space on the line it prevents the regexp from catching the `\n{3,}` occurrences. There is no `\n{3,}` in my code. As for non empty line not getting deleted, that's consistent with what you asked. Are you now asking to consider lines with just whitespace to be empty? I need to catch all spaces that are preceded or followed by additional whitespace. Instead of just {space}{space} it should also check for {space}\s That makes no sense. That says that {space}{space} should be collapsed to a space (which happens) and that {space}{newline} should be collapsed to {space} (which contradicts what you did say and makes no sense).	[reply] [d/l] [select]