regexp: removing extra whitespace

perlmax has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: regexp: removing extra whitespace by ikegami (Patriarch) on Nov 04, 2011 at 19:08 UTC
`s/\s(?<![ \n])//g; s/ \K +//g; s/\n\n\K\n+//g;` [download] The order of the first two matters (e.g. `foo{space}{tab}{space}bar`). I gave them in the same order you requested them. I find it odd that `foo{tab}bar` should become `foobar`. One usually wants `foo{space}bar`. To get the latter, `s/(?:\s(?<![ \n]))+/ /g; s/\n\n\K\n+//g;` [download] `\s(?<![ \n])` [download] is currently equivalent to `[\x{0009}\x{000B}-\x{000D}\x{0085}\x{00A0}\x{1680}\x{180E}\x{2000}-\x{ +200A}\x{2028}\x{2029}\x{202F}\x{205F}\x{3000}]` [download] or sometimes the buggy `[\x{0009}\x{000B}-\x{000D}\x{1680}\x{180E}\x{2000}-\x{200A}\x{2028}\x{ +2029}\x{202F}\x{205F}\x{3000}]` [download] Update: While U+000B is considered a space by Unicode and `\p{Space}`, it's not considered a space by `\s` for historical reasons.	[reply] [d/l] [select]
Re^2: regexp: removing extra whitespace by Eliya (Vicar) on Nov 04, 2011 at 19:52 UTC
Shouldn't - in theory - `[^\S \n]` match the same set as your `\s(?<![ \n])` (`\S` being complementary to `\s`)? Just tried it with my perl (v5.12.2), and `[^\S \n]` doesn't match `\x{0085}` and `\x{00A0}`, while `\s(?<![ \n])` does. Now I'm wondering why... BTW, `\v` (`\x{000B}`) isn't matched in either case, here.	[reply] [d/l] [select]
Re^3: regexp: removing extra whitespace by ikegami (Patriarch) on Nov 04, 2011 at 20:57 UTC
Yes, `[^\S \n]` and `\s(?<![ \n])` are equivalent. Well, should be. Just tried it with my perl (v5.12.2), and `[^\S \n]` doesn't match \x{0085} and \x{00A0} Sometimes it won't because of a bug, but that applies to both `[^\S \n]` and `\s(?<![ \n])`. See Re: Can I change \s?. 5.12 seems to have another problem on top of that. 5.12: `$ perl -le'print "\x{00A0}" =~ /[^\S \n]/ ?1:0;' 0 # Expected $ perl -E'say "\x{00A0}" =~ /[^\S \n]/ ?1:0;' 0 # Feature unicode_strings doesn't fix regexes yet. $ perl -le'print "\N{U+00A0}" =~ /[^\S \n]/ ?1:0;' 0 # Surprised! $ perl -le'print "\x{2660}\x{00A0}" =~ /[^\S \n]/ ?1:0;' 0 # Surprised!` [download] (Last two are really the same.) Now with what should be an equivalent pattern. `$ perl -le'print "\x{00A0}" =~ /\s(?<![ \n])/ ?1:0;' 0 # Expected $ perl -E'say "\x{00A0}" =~ /\s(?<![ \n])/ ?1:0;' 0 # Feature unicode_strings doesn't fix regexes yet. $ perl -le'print "\N{U+00A0}" =~ /\s(?<![ \n])/ ?1:0;' 1 # \N always returns an upgraded string. $ perl -le'print "\x{2660}\x{00A0}" =~ /\s(?<![ \n])/ ?1:0;' 1 # Forces the use of an upgraded string.` [download] 5.14: `$ perl -le'print "\x{00A0}" =~ /[^\S \n]/ ?1:0;' 0 # Bug kept for backwards compatibility $ perl -E'say "\x{00A0}" =~ /[^\S \n]/ ?1:0;' 1 $ perl -le'print "\N{U+00A0}" =~ /[^\S \n]/ ?1:0;' 1 $ perl -le'print "\x{2660}\x{00A0}" =~ /[^\S \n]/ ?1:0;' 1` [download]	[reply] [d/l] [select]
Re^4: regexp: removing extra whitespace by Eliya (Vicar) on Nov 04, 2011 at 21:28 UTC
Re^5: regexp: removing extra whitespace by ikegami (Patriarch) on Nov 05, 2011 at 01:11 UTC
Re^4: regexp: removing extra whitespace by perlmax (Initiate) on Nov 04, 2011 at 21:37 UTC
Re^5: regexp: removing extra whitespace by GrandFather (Saint) on Nov 04, 2011 at 21:41 UTC
Some notes below your chosen depth have not been shown here
Re^2: regexp: removing extra whitespace by perlmax (Initiate) on Nov 04, 2011 at 19:43 UTC
Thanks for your response. I tried the regexp you posted and I'm still encountering a few problems. There are many lines that only contain a single space. Since there's still a space on the line it prevents the regexp from catching the `\n{3,}` occurrences. I should have specified in my original post that I need to catch all spaces that are preceded or followed by additional whitespace. Instead of just {space}{space} it should also check for {space}\s. How can I revise the regexp you posted to include that functionality? Document example after the regexp: `\n{space} \n{space} \n{space}` [download] All of the tabs in the document appear after a new line and since I'm preserving the new line characters I'm not worried about replacing the tabs with a space. Any tabs found in the middle of a line would be accidental but I'd still like to check for them and remove them if found.	[reply] [d/l] [select]
Re^3: regexp: removing extra whitespace by ikegami (Patriarch) on Nov 04, 2011 at 20:52 UTC
Since there's still a space on the line it prevents the regexp from catching the `\n{3,}` occurrences. There is no `\n{3,}` in my code. As for non empty line not getting deleted, that's consistent with what you asked. Are you now asking to consider lines with just whitespace to be empty? I need to catch all spaces that are preceded or followed by additional whitespace. Instead of just {space}{space} it should also check for {space}\s That makes no sense. That says that {space}{space} should be collapsed to a space (which happens) and that {space}{newline} should be collapsed to {space} (which contradicts what you did say and makes no sense).	[reply] [d/l] [select]
Re: regexp: removing extra whitespace by JavaFan (Canon) on Nov 05, 2011 at 00:52 UTC
Do you want all rules to be performed simultanuously? Or sequentially? That is, if I have `"foo{space}{tab}{space}bar"` should that result in `"foo{space}{space}bar"`, or in `"foo{space}bar"`? Rule 2 says repeating spaces should be collapsed, but the original string doesn't have repeated spaces - they only repeat after rule 1 has been applied. Assuming rules should be applied in order: `no warnings "uninitialized"; s/([ \n])\|\s/$1/g; s/(\s)\K\1+//g;` [download] If they apply all at once: `no warnings "uninitialized"; s/([ \n])\1+\|\s/$1/g;` [download] (None of the snippets above were tested).	[reply] [d/l] [select]
Re: regexp: removing extra whitespace by Khen1950fx (Canon) on Nov 05, 2011 at 03:33 UTC
I would do it sequentially. Starting with your first question, I'd remove extra whitespace. `#!/usr/bin/perl -l use strict; use warnings; my $str = "this is an example"; $str =~ s/\s+/ /g; print $str;` [download] I tried a bunch of different methods, but this was the most consistent, easiest way that I could find. Update: I modified the code from String::Trim so that it only removes extra whitespace from within the string: #!/usr/bin/perl -l use strict; my $str = "This is a start, but not a finished product howe +ver. "; my @str = ('This is a start, ', 'but not a finished product + however. '); trim($str); trim(@str); print $str; print @str; sub trim { my $t =~ s/\s+/ /g; if (defined wantarray) { @_ = (@_ ? @_ : $_); if (ref $_[0] eq 'ARRAY') { @_ = @{ $_[0];}; foreach $_ (@_) { s/\s+/ /g if defined $_ } return \@_; } elsif (ref $_[0] eq 'HASH') { foreach my $k (keys %{$_[0];}) { (my $nk = $k) =~ s/\s+/ /g; if (defined $_[0]->{$k}) { ($_[0]->{$nk} = $_[0]->{$k}) =~ s/\s+/ /g; } else { $_[0]->{$nk} = undef; } delete $_[0]->{$k} unless $k eq $nk; } } else { for (@_ ? @_ : $_) { s/\s+/ /g if defined $_ } } return wantarray ? @_ : $_[0]; } else { if (ref $_[0] eq 'ARRAY') { for (@{ $_[0] }) { s/\s+/ /g if defined $_ } } elsif (ref $_[0] eq 'HASH') { foreach my $k (keys %{ $_[0] }) { (my $nk = $k) =~ s/\s+/ /g; if (defined $_[0]->{$k}) { ($_[0]->{$nk} = $_[0]->{$k}) =~ s/\s+/ /g; } else { $_[0]->{$nk} = undef; } delete $_[0]->{$k} unless $k eq $nk } } else { for (@_ ? @_ : $_) { s/\s+/ /g if defined $_ } } } } [download]	[reply] [d/l] [select]
Re^2: regexp: removing extra whitespace by ikegami (Patriarch) on Nov 05, 2011 at 20:14 UTC
I tried a bunch of different methods, but this was the most consistent, easiest way that I could find. The most consistent at doing what? Not at doing what the OP wants, that's for sure.	[reply]