SavannahLion has asked for the wisdom of the Perl Monks concerning the following question:
I'm hoping someone can clarify some behavior for me.
I have the following two code blocks I'm fiddling with. First I tried the following block.
my $phrase = "This is a test, \"using quotes of 'two different' types.
+\"";
$phrase =~ s/[\S*\W*]//g;
print $phrase;
For some reason, the regex destroys the entire line. Therefor, print just prints a blank line. So after some fiddling I came up with the following block.
my $phrase = "This is a test, \"using quotes of 'two different' types.
+\"";
$phrase =~ s/[^\s*\w*]//g;
print $phrase;
Which does exactly what I was aiming for in the first place. It produces the following line: This is a test using quotes of two different types All quotes, periods, and everything else has been stripped.
In my llama book it states that [^\s] is the same as \S and [^\w] is the same as \W. Now, from what I understand so far, the first block of code should have worked, but it didn't.
Why is that?
Is it fair to stick a link to my site here?
Thanks for you patience.
Re: ^\s not equal \S?
by davido (Cardinal) on Dec 04, 2003 at 08:46 UTC
|
First, quantifiers inside character classes are not seen as quantifiers, but rather, as literal characters to become a part of the character class. Put your * quantifier outside of the character class if that's what you intend.
Next, [\S\W] means anything that's either a non-space character, or a non-word character (usually A-Za-z_). Well, just about everything is either a non-space or a non-word. In fact, since there is no overlap between \s and \w, you've just wiped out the entire line (especially with the /g modifier). Every character I can think of would either fit the "not space" or the "not word" catagory, and thus, every character is wiped out.
The second expression is a negated character class. You still need to get rid of those * quantifiers inside of the square brackets. The negated character class is saying any character that is not either a space or a word. That's different. The only characters that are neither space nor word, are things like comma, quote, (and many others).
So where your first regex matches everything, and substitutes it with nothing (thus wiping out the string), the second regex matches just characters that are neither space nor word, and substitutes those characters with nothing, leaving you with spaces and words.
| [reply] [Watch: Dir/Any] [d/l] |
Re: ^\s not equal \S?
by Abigail-II (Bishop) on Dec 04, 2003 at 09:54 UTC
|
$ perl -Dr -ce '/[\S*\W*]/'
Compiling REx `[\S*\W*]'
size 13 Got 108 bytes for offset annotations.
first at 1
1: ANYOF[\0-\377!utf8::IsSpacePerl !utf8::IsWord](13)
13: END(0)
stclass `ANYOF[\0-\377!utf8::IsSpacePerl !utf8::IsWord]' minlen 1
Offsets: [13]
1[8] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 9[
+0]
Omitting $` $& $' support.
EXECUTING...
-e syntax OK
Freeing REx: `"[\\S*\\W*]"'
Now, pay attention to the 'ANYOF' part. It includes all the
ASCII and LATIN-1 characters (it also includes lots of Unicode characters, but that's not important right now).
Abigail | [reply] [Watch: Dir/Any] [d/l] |
|
| [reply] [Watch: Dir/Any] |
|
| [reply] [Watch: Dir/Any] |
Re: ^\s not equal \S?
by allolex (Curate) on Dec 04, 2003 at 09:18 UTC
|
Here's some data to illustrate what is happening. I think davido's explanation really hits the nail on the head and Anonymous Monk's suggestion YAPE::Regex::Explain will help you isolate what is happening with your experimentation. BTW, including an asterisk in some of the character classes is redundant, other times, you might see all your asterisks disappear from your input---not what you intended, I think.
ladoix% cat 312141.pl
#!usr/bin/perl
use strict;
use warnings;
my $phrase1 = "This is a test, \"using quotes of 'two different' types
+.\"";
$phrase1 =~ s/[\S\W]*//g;
print "P1: [$phrase1]\n";
my $phrase2 = "This is a test, \"using quotes of 'two different' types
+.\"";
$phrase2 =~ s/[\S\w]*//g;
print "P2: [$phrase2]\n";
my $phrase3 = "This is a test, \"using quotes of 'two different' types
+.\"";
$phrase3 =~ s/[\s\W]*//g;
print "P3: [$phrase3]\n";
my $phrase4 = "This is a test, \"using quotes of 'two different' types
+.\"";
$phrase4 =~ s/[\s\w]*//g;
print "P4: [$phrase4]\n";
my $phrase5 = "This is a test, \"using quotes of 'two different' types
+.\"";
$phrase5 =~ s/[^\s\w]*//g;
print "P5: [$phrase5]\n";
my $phrase6 = "This is a test, \"using quotes of 'two different' types
+.\"";
$phrase6 =~ s/[^\S\w]*//g;
print "P6: [$phrase6]\n";
my $phrase7 = "This is a test, \"using quotes of 'two different' types
+.\"";
$phrase7 =~ s/[^\s\W]*//g;
print "P7: [$phrase7]\n";
my $phrase8 = "This is a test, \"using quotes of 'two different' types
+.\"";
$phrase8 =~ s/[^\S\W]*//g;
print "P8: [$phrase8]\n";
ladoix% perl 312141.pl
P1: []
P2: [ ]
P3: [Thisisatestusingquotesoftwodifferenttypes]
P4: [,"''."]
P5: [This is a test using quotes of two different types]
P6: [Thisisatest,"usingquotesof'twodifferent'types."]
P7: [ , " ' ' ."]
P8: [This is a test, "using quotes of 'two different' types."]
| [reply] [Watch: Dir/Any] [d/l] |
Re: ^\s not equal \S?
by Anonymous Monk on Dec 04, 2003 at 08:50 UTC
|
use YAPE::Regex::Explain;
print YAPE::Regex::Explain->new(qr/[\S*\W*]/)->explain;
__END__
The regular expression:
(?-imsx:[\S*\W*])
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
[\S*\W*] any character of: non-whitespace (all but
\n, \r, \t, \f, and " "), '*', non-word
characters (all but a-z, A-Z, 0-9, _), '*'
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
20031204 Edit by Corion: Changed PRE tags to CODE tags | [reply] [Watch: Dir/Any] [d/l] |
Re: ^\s not equal \S?
by Anonymous Monk on Dec 04, 2003 at 15:04 UTC
|
Welcome to Boolean Algebra!
[^\s\w] = NOT (space OR word)
[\S\W] = (NOT space) OR (NOT word)
(NOT space) OR (NOT word) = NOT (space AND word)
| [reply] [Watch: Dir/Any] [d/l] |
|
Well, I finally did it. I ++'d an anonymous post. Good answer. I was going to say something like:
[^\s\w] means neither spaces nor words, or, everything but spaces and words, where
[\S\W] means non-spaces AND non-words, and since a space is a non-word, the meaning is similar to "non-spaces and spaces".
However, ^\s and \S should be the same, so...
$phrase =~ s/[^\s*\w*]//g;
would do the same thing as...
$phrase =~ s/[\S]|[\W]|[^*]//g;
Update: No, it wouldn't. :: smacks self in head :: | [reply] [Watch: Dir/Any] [d/l] [select] |
|
Wait a minute.... I just tried your example and it still wipes out the string. It's like a nice big OR isn't it? So isn't s/[\S]|[\W]//g really the same as s/[\S\W]//g?
Edit: A negative demark for pointing out this out? Bleh, now that isn't exactly fair. Oh well, I still haven't quite worked out what benefits voting would have.
Anyhow, I've fiddled with the regexes and the response from delirium still doesn't quite fit in with what everything else tells me. Plugging in the two different RegExs in to the script yields completely different results. And given the examples by anon, it would naturally make sense.
Is it fair to stick a link to my site here?
Thanks for you patience.
| [reply] [Watch: Dir/Any] [d/l] [select] |
Re: ^\s not equal \S?
by SavannahLion (Pilgrim) on Dec 04, 2003 at 18:08 UTC
|
Oooohh, I get it now.
Is it fair to stick a link to my site here?
Thanks for you patience.
| [reply] [Watch: Dir/Any] |
|
|