regex doubt on excluding

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.

Re: regex doubt on excluding
by hdb (Monsignor) on Apr 21, 2014 at 08:40 UTC

A question on whitespace is difficult to address as one cannot see the objects of interest. Your regex suggests that in a multiline string you want to replace lines consisting only of whitespaces with truly empty lines. Whitespaces in non-empty lines will be preserved. I am replacing the whitespace with 'x' to see where we got matches:

use strict;
use warnings;

my $string = "next line is spaces
      
next line is tabs
        
and now some newlines



end";

$string =~ s/^\s*$/x/mg;

print "$string\n";
[download]

next line is spaces
x
next line is tabs
x
and now some newlines
xx
end
[download]

[reply]
[d/l]
[select]

Re^2: regex doubt on excluding

by Anonymous Monk on Apr 21, 2014 at 08:44 UTC

Yes, that's it, Preserve multiple empty lines. (but without white-spaces)

[reply]

Re^3: regex doubt on excluding

by hdb (Monsignor) on Apr 21, 2014 at 08:46 UTC

How about this?

$string =~ s/^\s*?\n$/\n/mg;
[download]

UPDATE: the regex is not working properly but this should, the \n and the $ are somewhat duplicate:

$string =~ s/^\s*?\n/\n/mg;
[download]

[reply]
[d/l]
[select]

Re^4: regex doubt on excluding

by Anonymous Monk on Apr 21, 2014 at 09:43 UTC

Re: regex doubt on excluding
by Laurent_R (Canon) on Apr 21, 2014 at 08:31 UTC

[ \t]
[download]

[reply]
[d/l]

Re^2: regex doubt on excluding

by Anonymous Monk on Apr 21, 2014 at 08:40 UTC

Thank You Laurent.

So it should be something like this,

$string =~ s/^[ \t]$//mg;

but that code is not working for me!

[reply]
[d/l]

Re^3: regex doubt on excluding

by Laurent_R (Canon) on Apr 21, 2014 at 08:53 UTC

$string =~ s/^[ \t]+$//mg;
[download]

^

$

m

[reply]
[d/l]
[select]

Re^4: regex doubt on excluding

by Anonymous Monk on Apr 21, 2014 at 09:48 UTC

Re: regex doubt on excluding
by Athanasius (Cardinal) on Apr 21, 2014 at 08:41 UTC

Sorry, but the code shown is not removing newlines, as is easily demonstrated:

18:23 >perl -wE "my $s = qq[abc\n\t  \ndef\n  \ngh]; $s =~ s/^\s*$/X/m
+g; say $s;"
abc
X
def
X
gh

18:24 >
[download]

That’s because the /m modifier lets ^ and $ match at the beginning and end of each line within the string (see perlre#Modifiers). What the code does is to remove any whitespace from an otherwise empty line, i.e. whitespace is removed if and only if the whitespace is the only thing between two newlines (or between the beginning of the string and the first newline, or between the last newline and the end of the string). Is this what was intended? Or were you wanting to remove all whitespace except newlines themselves? If the latter, Laurent_R’s approach is what you want:

18:24 >perl -wE "my $s = qq[abc\n\t  \nd\tef\n  \ngh]; $s =~ s/[ \t]+/
+X/mg; say $s;"
abc
X
dXef
X
gh

18:39 >
[download]

Hope that helps,

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

[reply]
[d/l]
[select]

Re^2: regex doubt on excluding

by Anonymous Monk on Apr 21, 2014 at 08:47 UTC

Preserve multiple empty lines. (but without white-spaces)

[reply]

Re^3: regex doubt on excluding

by Athanasius (Cardinal) on Apr 21, 2014 at 09:28 UTC

Ok, I understand now, and it seems I spoke too soon: the original code is removing some newlines, since it reduces a sequence of successive newlines to a single one.

I don’t understand how this is working. From perlre#Regular-Expressions:

By default, the "^" character is guaranteed to match only the beginning of the string, the "$" character only the end (or before the newline at the end), and Perl does certain optimizations with the assumption that the string contains only one line. Embedded newlines will not be matched by "^" or "$". You may, however, wish to treat a string as a multi-line buffer, such that the "^" will match after any newline within the string (except if the newline is the last character in the string), and "$" will match before any newline. At the cost of a little more overhead, you can do this by using the /m modifier on the pattern match operator.

— but I don’t see how this explains the behaviour we are seeing?

Read more... (651 Bytes)

Can someone please explain what the regex is doing here?

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

[reply]
[d/l]

Re^4: regex doubt on excluding

by SuicideJunkie (Vicar) on Apr 21, 2014 at 14:49 UTC

Re^5: regex doubt on excluding

by Athanasius (Cardinal) on Apr 22, 2014 at 09:33 UTC

Some notes below your chosen depth have not been shown here

Re^5: regex doubt on excluding

by Lotus1 (Vicar) on Apr 21, 2014 at 17:44 UTC

Re^4: regex doubt on excluding

by Anonymous Monk on Apr 21, 2014 at 09:47 UTC

Re: regex doubt on excluding
by jellisii2 (Hermit) on Apr 21, 2014 at 11:27 UTC

The "big hammer" approach says replace all of those characters with a newline.

s/^\s*$/\n/mg;

I'm a fan of the big hammer approach, but as you'd expect subtlety isn't exactly my strong point.

[reply]
[d/l]

Re: regex doubt on excluding
by kcott (Archbishop) on Apr 22, 2014 at 02:31 UTC

If you look in "perlre: Metacharacters", you'll see that '$' matches "the end of the line (or before newline at the end)" (emphasis added). So, what you really want is to match any whitespace character except if it follows '$'. That makes your substitution "s/(?<!$)\s//gm", shown here:

#!/usr/bin/env perl -l

use strict;
use warnings;

my $string = "\t is a TAB\n is a SPACE\nTAB\t and SPACE";

print '*** BEFORE ***';
print $string;

$string =~ s/(?<!$)\s//gm;

print '*** AFTER ***';
print $string;
[download]

Output:

*** BEFORE ***
     is a TAB
 is a SPACE
TAB     and SPACE
*** AFTER ***
isaTAB
isaSPACE
TABandSPACE
[download]

However, if you know that you only want to match the whitespace characters space (" ") and tab ("\t"), then the transliteration "y/\t //d" will be faster. (See "Search and replace or tr" in "Perl Performance and Optimization Techniques" for a Benchmark example.) As you can see, the code is virtually identical (which makes replacing the s/// with y/// easy):

#!/usr/bin/env perl -l

use strict;
use warnings;

my $string = "\t is a TAB\n is a SPACE\nTAB\t and SPACE";

print '*** BEFORE ***';
print $string;

$string =~ y/\t //d;

print '*** AFTER ***';
print $string;
[download]

Output:

*** BEFORE ***
     is a TAB
 is a SPACE
TAB     and SPACE
*** AFTER ***
isaTAB
isaSPACE
TABandSPACE
[download]

[In case you didn't know, y/// and tr/// are synonymous. You'll find both forms used in different sections of the perlop documentation.]

-- Ken

[reply]
[d/l]
[select]

Re: regex doubt on excluding
by AnomalousMonk (Archbishop) on Apr 21, 2014 at 19:47 UTC

The 'standard' way to construct a character class that excludes characters is [^...].

One way to define the class of all whitespace except a newline would be [^\S\n]. (Note this the big-S \S which is the inverse of \s i.e., \S matches any character that is not whitespace.) So this is a kind of double-negative. You might read it as "the class of all characters that are not non-whitespace and also not a newline."

[reply]
[d/l]
[select]