in reply to Deleting intermediate whitespaces, but leaving one behind each word

You have received some excellent suggestions already but, for the sake of completeness, here's a solution that also removes leading spaces if necessary. You can pre-compile a regular expression using qr{ ... } ( see Quote and Quote like Operators ) for use in a later match or substitution and you can use extended syntax to comment the expression. I prefer if possible to match only what we want to remove and replace it with nothing rather than matching what we want to remove and capturing what we want to keep, using the capture ( $1 etc. ) in the replacement part. We want to match spaces preceded by the beginning of the string OR spaces preceded by a single space OR spaces followed by the end of the string. To do this we can use zero-width look-behind and look-ahead assertions ( see "Lookaround Assertions" in Extended Patterns ) to match multiple spaces just where we want. This code:-

use strict; use warnings; use feature qw{ say }; my $str = q{ Intel(R) Xeon(R) CPU X5660 2.80GHz }; my $rxSpaces = qr{(?x) # Use regex extended syntax to allow comments (?: # Open non-capturing group for alternation (?<= \A ) \s+ # Spaces preceded by beginning of string | # or (?<= \s ) \s+ # Spaces preceded by a single space | # or \s+ (?= \z ) # Spaces followed by end of string ) # Close group }; # Replace matching spaces by nothing globally. # $str =~ s{$rxSpaces}{}g; say qq{-->$str<--};

produces this output:-

-->Intel(R) Xeon(R) CPU X5660 2.80GHz<--

I hope this is of interest.

Cheers,

JohnGG

Replies are listed 'Best First'.
Re^2: Deleting intermediate whitespaces, but leaving one behind each word
by AnomalousMonk (Archbishop) on Dec 06, 2017 at 03:51 UTC

    Please forgive the nit-picky nature of this reply, but your post raised a number of interesting points.

    my $rxSpaces = qr{(?x) # Use regex extended syntax to allow comments (?: # Open non-capturing group for alternation (?<= \A ) \s+ # Spaces preceded by beginning of string | # or (?<= \s ) \s+ # Spaces preceded by a single space | # or \s+ (?= \z ) # Spaces followed by end of string ) # Close group };

    Many of the details of this regex no doubt have an expository purpose. However, more or less in descending order of importance:

    • In the  (?<= \A ) \s+ and  \s+ (?= \z ) sub-patterns, the zero-width look-around assertions are overkill because  \A and  \z are already zero-width assertions, so the simpler  \A \s+ and  \s+ \z (respectively) are exactly equivalent and IMHO preferable;
    • The  (?: ... ) non-capturing group surrounding the alternation is redundant because the whole  qr// is effectively wrapped in a non-capturing group;
    • Lastly, the  (?x) at the start of the regex is IMHO to be avoided in favor of a standard  /xms tail for this (and every!) regex. (This is my personal regex best practice.)
    Then what you have is a regex like
        qr{ (?<= \s) \s+ | \A \s+ | \s+ \z }xms
    which IMHO is very easy to understand.

    The use of Perl's ordered regex alternation raises the question the proper order of the sub-patterns. My experience has been that only testing can answer this question reliably:

    c:\@Work\Perl\monks>perl -wMstrict -le "use Test::More 'no_plan'; use Test::NoWarnings; ;; note 'perl version: ', $]; ;; use constant S => ' Intel(R) Xeon(R) CPU X5660 2.80GHz '; use constant T => 'Intel(R) Xeon(R) CPU X5660 2.80GHz'; ;; for my $rxSpaces ( qr{ (?<= \s) \s+ | \A \s+ | \s+ \z }xms, qr{ \A \s+ | (?<= \s) \s+ | \s+ \z }xms, qr{ \A \s+ | \s+ \z | (?<= \s) \s+ }xms, ) { (my $t = S) =~ s{$rxSpaces}{}g; ok $t eq T, qq{$rxSpaces -> \n >$t<}; } ;; note qq{still with spaces? >${ \S }<}; done_testing; " # perl version: 5.008009 ok 1 - (?msx-i: (?<= \s) \s+ | \A \s+ | \s+ \z ) -> # >Intel(R) Xeon(R) CPU X5660 2.80GHz< ok 2 - (?msx-i: \A \s+ | (?<= \s) \s+ | \s+ \z ) -> # >Intel(R) Xeon(R) CPU X5660 2.80GHz< ok 3 - (?msx-i: \A \s+ | \s+ \z | (?<= \s) \s+ ) -> # >Intel(R) Xeon(R) CPU X5660 2.80GHz< # still with spaces? > Intel(R) Xeon(R) CPU X5660 2.80GHz +< 1..3 ok 4 - no warnings 1..4
    Ok, no ordering dependency is seen.

    Now you think, "Gee, with Perl 5.10 there's that neat  \K variable-width look-behind emulation operator I can use to simplify the regex even more!" Unfortunately, after testing (and you always test this stuff, right?) you find a problem:

    c:\@Work\Perl\monks>perl -wMstrict -le "use Test::More 'no_plan'; use Test::NoWarnings; ;; note 'perl version: ', $]; ;; use constant S => ' Intel(R) Xeon(R) CPU X5660 2.80GHz '; use constant T => 'Intel(R) Xeon(R) CPU X5660 2.80GHz'; ;; for my $rxSpaces ( qr{ (?<= \s) \s+ | \A \s+ | \s+ \z }xms, qr{ \A \s+ | (?<= \s) \s+ | \s+ \z }xms, qr{ \A \s+ | \s+ \z | (?<= \s) \s+ }xms, qr{ \s \K \s+ | \A \s+ | \s+ \z }xms, qr{ \A \s+ | \s \K \s+ | \s+ \z }xms, qr{ \A \s+ | \s+ \z | \s \K \s+ }xms, ) { (my $t = S) =~ s{$rxSpaces}{}g; ok $t eq T, qq{$rxSpaces -> \n >$t<}; } ;; note qq{still with spaces? >${ \S }<}; done_testing; " # perl version: 5.010001 ok 1 - (?msx-i: (?<= \s) \s+ | \A \s+ | \s+ \z ) -> # >Intel(R) Xeon(R) CPU X5660 2.80GHz< ok 2 - (?msx-i: \A \s+ | (?<= \s) \s+ | \s+ \z ) -> # >Intel(R) Xeon(R) CPU X5660 2.80GHz< ok 3 - (?msx-i: \A \s+ | \s+ \z | (?<= \s) \s+ ) -> # >Intel(R) Xeon(R) CPU X5660 2.80GHz< not ok 4 - (?msx-i: \s \K \s+ | \A \s+ | \s+ \z ) -> # > Intel(R) Xeon(R) CPU X5660 2.80GHz < # Failed test '(?msx-i: \s \K \s+ | \A \s+ | \s+ \z ) + -> # > Intel(R) Xeon(R) CPU X5660 2.80GHz <' # at -e line 1. not ok 5 - (?msx-i: \A \s+ | \s \K \s+ | \s+ \z ) -> # >Intel(R) Xeon(R) CPU X5660 2.80GHz < # Failed test '(?msx-i: \A \s+ | \s \K \s+ | \s+ \z ) + -> # >Intel(R) Xeon(R) CPU X5660 2.80GHz <' # at -e line 1. ok 6 - (?msx-i: \A \s+ | \s+ \z | \s \K \s+ ) -> # >Intel(R) Xeon(R) CPU X5660 2.80GHz< # still with spaces? > Intel(R) Xeon(R) CPU X5660 2.80GHz +< 1..6 ok 7 - no warnings 1..7 # Looks like you failed 2 tests of 7.
    Hmmm... The  (?<= \s) \s+ sub-pattern continues to work just fine everywhere, but the seemingly equivalent  \s \K \s+ sub-pattern only works in the last position in the ordered alternation. Why? (Food for thought, this.)

    A lot of these points echo those made by Laurent_R here: regexes are really neat and I love them, but they're not always the ideal tool for the job.


    Give a man a fish:  <%-{-{-{-<

Re^2: Deleting intermediate whitespaces, but leaving one behind each word
by Laurent_R (Canon) on Dec 06, 2017 at 00:32 UTC
    Yes, johngg,

    TIMTOWTDI, and, yes, I think this is definitely of interest++. Having said that, I feel that using zero-width look-around assertions for such a simple case might be a little bit of an overkill. Well, at least for a beginner who obviously doesn't know very much about regexes at this point.

    Personally, I'm using look-around assertions only from time to time (sometimes, it is really the best solution), but not often enough to always remember the exact syntax by heart, so that when I feel this is the right solution, I usually have to look it up in Johan Vromans's Perl Pocket Reference (or on the net or somewhere else). For such a simple case, I would rather do most of the job with a simple s/\s+/ /g regex, and add one or two simple regexes to handle leading and trailing spaces if needed. My colleagues having to maintain my code will probably thank me for that and I will be even more delighted when the person having to maintain this code a year from now will be... me.

    BTW, Perl 6's regexes have a much cleaner syntax for look-around assertions, so that I would not have the same second thoughts in P6. But that's getting slightly OT, sorry for that.