perlmax has asked for the wisdom of the Perl Monks concerning the following question:

I need to create a regular expression that will remove all extra whitespace from a text document with the exception of spaces and new lines. The document is read into a variable and I then perform a search and replace using regexp. Unfortunately I haven't been able to remove all of the extra whitespace with a single regexp command. Is it possible to use a single regexp statement to accomplish all of the following tasks?

* remove all whitespace other than spaces and new lines
* remove any repeating spaces that appear one after another
* limit number of new lines that appear one after another to a total of 2 (double spaced)

EXAMPLE:
"this is an {space}{space}example\n\n\t\nthis is line two"

In the example above I would need to remove the two extra spaces, the tab, and one of the new lines. The simplest way to remove extra whitespace would be to use something like this /[\t\v\f\r]|\s(?=\s)/ and then just replace the occurrences with an empty string. However, I need to preserve some spaces and new lines so I'm unable to use \s and the statement above doesn't catch all unicode whitespace characters.

Replies are listed 'Best First'.
Re: regexp: removing extra whitespace
by ikegami (Patriarch) on Nov 04, 2011 at 19:08 UTC
    s/\s(?<![ \n])//g; s/ \K +//g; s/\n\n\K\n+//g;

    The order of the first two matters (e.g. foo{space}{tab}{space}bar). I gave them in the same order you requested them.


    I find it odd that foo{tab}bar should become foobar. One usually wants foo{space}bar. To get the latter,

    s/(?:\s(?<![ \n]))+/ /g; s/\n\n\K\n+//g;

    \s(?<![ \n])

    is currently equivalent to

    [\x{0009}\x{000B}-\x{000D}\x{0085}\x{00A0}\x{1680}\x{180E}\x{2000}-\x{ +200A}\x{2028}\x{2029}\x{202F}\x{205F}\x{3000}]

    or sometimes the buggy

    [\x{0009}\x{000B}-\x{000D}\x{1680}\x{180E}\x{2000}-\x{200A}\x{2028}\x{ +2029}\x{202F}\x{205F}\x{3000}]

    Update: While U+000B is considered a space by Unicode and \p{Space}, it's not considered a space by \s for historical reasons.

      Shouldn't - in theory - [^\S \n] match the same set as your \s(?<![ \n])  (\S being complementary to \s)?

      Just tried it with my perl (v5.12.2), and [^\S \n] doesn't match \x{0085} and \x{00A0}, while \s(?<![ \n]) does.  Now I'm wondering why...

      BTW, \v (\x{000B}) isn't matched in either case, here.

        Yes, [^\S \n] and \s(?<![ \n]) are equivalent. Well, should be.

        Just tried it with my perl (v5.12.2), and [^\S \n] doesn't match \x{0085} and \x{00A0}

        Sometimes it won't because of a bug, but that applies to both [^\S \n] and \s(?<![ \n]). See Re: Can I change \s?.

        5.12 seems to have another problem on top of that.

        5.12:

        $ perl -le'print "\x{00A0}" =~ /[^\S \n]/ ?1:0;' 0 # Expected $ perl -E'say "\x{00A0}" =~ /[^\S \n]/ ?1:0;' 0 # Feature unicode_strings doesn't fix regexes yet. $ perl -le'print "\N{U+00A0}" =~ /[^\S \n]/ ?1:0;' 0 # Surprised! $ perl -le'print "\x{2660}\x{00A0}" =~ /[^\S \n]/ ?1:0;' 0 # Surprised!

        (Last two are really the same.)

        Now with what should be an equivalent pattern.

        $ perl -le'print "\x{00A0}" =~ /\s(?<![ \n])/ ?1:0;' 0 # Expected $ perl -E'say "\x{00A0}" =~ /\s(?<![ \n])/ ?1:0;' 0 # Feature unicode_strings doesn't fix regexes yet. $ perl -le'print "\N{U+00A0}" =~ /\s(?<![ \n])/ ?1:0;' 1 # \N always returns an upgraded string. $ perl -le'print "\x{2660}\x{00A0}" =~ /\s(?<![ \n])/ ?1:0;' 1 # Forces the use of an upgraded string.

        5.14:

        $ perl -le'print "\x{00A0}" =~ /[^\S \n]/ ?1:0;' 0 # Bug kept for backwards compatibility $ perl -E'say "\x{00A0}" =~ /[^\S \n]/ ?1:0;' 1 $ perl -le'print "\N{U+00A0}" =~ /[^\S \n]/ ?1:0;' 1 $ perl -le'print "\x{2660}\x{00A0}" =~ /[^\S \n]/ ?1:0;' 1
      Thanks for your response. I tried the regexp you posted and I'm still encountering a few problems. There are many lines that only contain a single space. Since there's still a space on the line it prevents the regexp from catching the \n{3,} occurrences. I should have specified in my original post that I need to catch all spaces that are preceded or followed by additional whitespace. Instead of just {space}{space} it should also check for {space}\s. How can I revise the regexp you posted to include that functionality? Document example after the regexp:
      \n{space} \n{space} \n{space}
      All of the tabs in the document appear after a new line and since I'm preserving the new line characters I'm not worried about replacing the tabs with a space. Any tabs found in the middle of a line would be accidental but I'd still like to check for them and remove them if found.

        Since there's still a space on the line it prevents the regexp from catching the \n{3,} occurrences.

        There is no \n{3,} in my code. As for non empty line not getting deleted, that's consistent with what you asked. Are you now asking to consider lines with just whitespace to be empty?

        I need to catch all spaces that are preceded or followed by additional whitespace. Instead of just {space}{space} it should also check for {space}\s

        That makes no sense. That says that {space}{space} should be collapsed to a space (which happens) and that {space}{newline} should be collapsed to {space} (which contradicts what you did say and makes no sense).

Re: regexp: removing extra whitespace
by JavaFan (Canon) on Nov 05, 2011 at 00:52 UTC
    Do you want all rules to be performed simultanuously? Or sequentially? That is, if I have "foo{space}{tab}{space}bar" should that result in "foo{space}{space}bar", or in "foo{space}bar"? Rule 2 says repeating spaces should be collapsed, but the original string doesn't have repeated spaces - they only repeat after rule 1 has been applied.

    Assuming rules should be applied in order:

    no warnings "uninitialized"; s/([ \n])|\s/$1/g; s/(\s)\K\1+//g;
    If they apply all at once:
    no warnings "uninitialized"; s/([ \n])\1+|\s/$1/g;
    (None of the snippets above were tested).
Re: regexp: removing extra whitespace
by Khen1950fx (Canon) on Nov 05, 2011 at 03:33 UTC
    I would do it sequentially. Starting with your first question, I'd remove extra whitespace.
    #!/usr/bin/perl -l use strict; use warnings; my $str = "this is an example"; $str =~ s/\s+/ /g; print $str;

    I tried a bunch of different methods, but this was the most consistent, easiest way that I could find.

    Update: I modified the code from String::Trim so that it only removes extra whitespace from within the string:
    #!/usr/bin/perl -l use strict; my $str = "This is a start, but not a finished product howe +ver. "; my @str = ('This is a start, ', 'but not a finished product + however. '); trim($str); trim(@str); print $str; print @str; sub trim { my $t =~ s/\s+/ /g; if (defined wantarray) { @_ = (@_ ? @_ : $_); if (ref $_[0] eq 'ARRAY') { @_ = @{ $_[0];}; foreach $_ (@_) { s/\s+/ /g if defined $_ } return \@_; } elsif (ref $_[0] eq 'HASH') { foreach my $k (keys %{$_[0];}) { (my $nk = $k) =~ s/\s+/ /g; if (defined $_[0]->{$k}) { ($_[0]->{$nk} = $_[0]->{$k}) =~ s/\s+/ /g; } else { $_[0]->{$nk} = undef; } delete $_[0]->{$k} unless $k eq $nk; } } else { for (@_ ? @_ : $_) { s/\s+/ /g if defined $_ } } return wantarray ? @_ : $_[0]; } else { if (ref $_[0] eq 'ARRAY') { for (@{ $_[0] }) { s/\s+/ /g if defined $_ } } elsif (ref $_[0] eq 'HASH') { foreach my $k (keys %{ $_[0] }) { (my $nk = $k) =~ s/\s+/ /g; if (defined $_[0]->{$k}) { ($_[0]->{$nk} = $_[0]->{$k}) =~ s/\s+/ /g; } else { $_[0]->{$nk} = undef; } delete $_[0]->{$k} unless $k eq $nk } } else { for (@_ ? @_ : $_) { s/\s+/ /g if defined $_ } } } }

      I tried a bunch of different methods, but this was the most consistent, easiest way that I could find.

      The most consistent at doing what? Not at doing what the OP wants, that's for sure.