a match question

jinqiyi has asked for the wisdom of the Perl Monks concerning the following question:

This is an example in 'perl cookbook':

 sub dequote;
$poem = dequote <<EVER_ON_AND_ON;
    Now far ahead the Road has gone,
      And I must follow, if I can,
    Pursuing it with eager feet,
      Until it joins some larger way
    Where many paths and errands meet.
      And whither then? I cannot say.
              --Bilbo 
EVER_ON_AND_ON
print "Here's your poem: \n\n$poem\n";
sub dequote {
    local $_ = shift;
    
    my ($white,$leader);
    if (/^\s*(?:([^\w\s]+)(\s*).*\n)(?:\s*\1\2?.*\n)+$/){
        ($white,$leader) = ($2,quotemeta($1)); 
    }else {
        ($white,$leader) = (/^(\s+)/,'');
    }
    
    s/^\s*?$leader(?:$white)?//gm;
    return $_;
}
[download]

I have a question about what match the [^\w\s\]+ in the string.

Comment on a match question Download Code

Replies are listed 'Best First'.
Re: a match question by davido (Cardinal) on Mar 12, 2011 at 06:51 UTC
Break it down to its components. `[]` builds a character set. Inside you've got `\w` and `\s`, which would normally match a word character or a space character. You've got an extra backslash in your question which isn't present in the cookbook regular expression. So get rid of that -- it's not really part of your intended question. Now, we've just about got it figured out, except there's that pesky `^` character, which turns out to negate the character class. So now it doesn't match words and spaces. Now it will match anything that is NOT a word or space character. The final step is the `+`, which tells perl to match one or more times. So it must match at least one or more characters that are NOT word or space characters. Word characters are usually A-Za-z_0-9 (A to Z caps and lowers, plus underscore and digits zero through nine). Space characters are what we generally think of as 'whitespace'. So to summarize, match any non-word or non-space character, and match as many as possible with a minimum of one. Update: Are you sure that's exactly the example from the Perl cookbook? I don't have my old copy handy, but that `sub dequote;` right at the beginning looks odd to me, though it's not actually a problem. It just seems odd given the context. Update 2: You could also use YAPE::Regex::Explain to check for an explanation yourself. ...or you could use it to verify the accuracy of my description. ;). `use YAPE::Regex::Explain; print YAPE::Regex::Explain->new('[^\w\s]+')->explain();` [download] And the relevant output: `[^\w\s]+ any character except: word characters (a- z, A-Z, 0-9, _), whitespace (\n, \r, \t, \f, and " ") (1 or more times (matching the most amount possible))` [download] Ta-da! Dave	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re: a match question
by davido (Cardinal) on Mar 12, 2011 at 06:51 UTC

Break it down to its components. [] builds a character set. Inside you've got \w and \s, which would normally match a word character or a space character. You've got an extra backslash in your question which isn't present in the cookbook regular expression. So get rid of that -- it's not really part of your intended question.

Now, we've just about got it figured out, except there's that pesky ^ character, which turns out to negate the character class. So now it doesn't match words and spaces. Now it will match anything that is NOT a word or space character.

The final step is the +, which tells perl to match one or more times. So it must match at least one or more characters that are NOT word or space characters.

Word characters are usually A-Za-z_0-9 (A to Z caps and lowers, plus underscore and digits zero through nine). Space characters are what we generally think of as 'whitespace'.

So to summarize, match any non-word or non-space character, and match as many as possible with a minimum of one.

Update: Are you sure that's exactly the example from the Perl cookbook? I don't have my old copy handy, but that sub dequote; right at the beginning looks odd to me, though it's not actually a problem. It just seems odd given the context.

Update 2: You could also use YAPE::Regex::Explain to check for an explanation yourself. ...or you could use it to verify the accuracy of my description. ;).

use YAPE::Regex::Explain;
print YAPE::Regex::Explain->new('[^\w\s]+')->explain();
[download]

And the relevant output:

[^\w\s]+                 any character except: word characters (a-
                         z, A-Z, 0-9, _), whitespace (\n, \r, \t,
                         \f, and " ") (1 or more times (matching
                         the most amount possible))
[download]

Ta-da!

Dave

[reply]
[d/l]
[select]