Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks,
Could you please explain me the following regexp, I think I am not using it right, is it only checking for words?, what about composed word "tear-down", etc, I also need to check that.
while ($string =~ /((\w|')+)/g) {
Thanks

Replies are listed 'Best First'.
Re: Regexp explanation
by Aragorn (Curate) on Apr 02, 2004 at 20:33 UTC
    The regex you're using is matching a "word character", i.e. any alphanumeric character and underscore, or a single quote, and as much of them in a row as possible. If you want it to match also a dash, you should say so:
    $string =~ /((\w|'|-)+)/g
    But that's bit un-regex-like. You should use a character class in this situation:
    $string =~ /([-\w']+)/g
    But this also matches dashes and apostrophes at the beginnings or ends of words, which may or may not be what you want. If not, you could force that a dash or apostrophe is between alphanumeric characters:
    $string =~ /(\w+[-']?\w+)/g
    But this has the unwanted effect that a word is at least 2 characters. So we can add an alternation, saying that we also allow a single character word (or number) if we can't match a word consisting of an alphanumerics with a dash or apostrophe between them:
    $string =~ /(\w+[-']?\w+|\w)/g
    A small complete test-case:
    #!/usr/local/bin/perl use strict; use warnings; $/ = undef; my $string = <DATA>; while ($string =~ /(\w+[-']?\w+|\w)/g) { print "Word: <$1>\n"; } __DATA__ This is a sentence with words that're different from other words. They have apostrophes in them (') and dashes, or dash-like characters (-).
    Try running this code and compare the output with the sentence in the __DATA__ section.

    The ultimate guide (in my opinion) on regular expressions is Jeffrey Friedl's Mastering Regular Expressions, 2nd Edition.

    Arjen

      Thanks, I'll run the code now. There are so many possible combination of word into a single composed word, that I need to test every case.
Re: Regexp explanation
by QM (Parson) on Apr 02, 2004 at 20:20 UTC
    while ($string =~ /((\w|')+)/g) {
    (\w|') matches either a word char or a single quote. Because the left paren here is the 2nd left paren in the regex, this captures into $2.

    ((\w|')+) matches one or more occurences of (\w|'). Because of the first left paren, this captures into $1. $2 will be the same as the last char of $1.

    //g in a scalar context returns the next match, so the while will continue until all matches are found.

    If you just want to add hypenated words, try this:

    while ($string =~ /(([\w'-])+)/g) {...do something...}

    Update: To get everything surrounded by whitespace, use

    while ($string =~ /((\S)+)/g) {...do something...}
    but this captures more than \w was doing. Or use split. Or see perlre.

    -QM
    --
    Quantum Mechanics: The dreams stuff is made of

Re: Regexp explanation
by duff (Parson) on Apr 02, 2004 at 20:25 UTC

    You might as well use a character class since you aren't being too discriminatory about the words you match (i.e., your RE will also match a "word" like foo'bar'baz)

    while ($string =~ /([\w'-]+)/g) { ... }

    Note that you have to put the dash character either as the first or last character in the character class so that it isn't interpretted as a range

Re: Regexp explanation
by matija (Priest) on Apr 02, 2004 at 20:11 UTC
    Your regexp would accept any combination of letters (would be A-Za-z in ASCII, but it depends on your locale), underline, and single quote. If you want it to allow words in quotes, and possibly surrounded by quotes, you could change it to
    /('?[\w-]+'?)/g
      I want my string to accept composed word which have a character "-" between them such as "tear-down" or by-the-house; I don't need to be surrounded by quotes... maybe to get everything separate by blanks in a sentence?
        So then why did you put the quotes in your regex?

        Never mind. It is important to remember that the Perl regexps are greedy: they will match as much as possible. Therefore to match a word, you only need to specify what a word is, there is no need to define what it's delimiters might be.

        That simplifies the regexp to:

        /([\w-]+)/g
Re: Regexp explanation
by Anonymous Monk on Apr 02, 2004 at 20:32 UTC
    Thanks, it is working
    About my RE, I did not mention that I already removed all the punctuation marks such as '"?/!, etc
    Thanks monks
Re: Regexp explanation
by eXile (Priest) on Apr 03, 2004 at 17:54 UTC
    In addition to all answers you already have, in understanding/learning regex-en YAPE::Regex::Explain can be very useful:
    use YAPE::Regex::Explain; print YAPE::Regex::Explain->new(qr/((\w|')+)/)->explain(); __DATA__ The regular expression: (?-imsx:((\w|')+)) matches as follows: NODE EXPLANATION ---------------------------------------------------------------------- (?-imsx: group, but do not capture (case-sensitive) (with ^ and $ matching normally) (with . not matching \n) (matching whitespace and # normally): ---------------------------------------------------------------------- ( group and capture to \1: ---------------------------------------------------------------------- ( group and capture to \2 (1 or more times (matching the most amount possible)): ---------------------------------------------------------------------- \w word characters (a-z, A-Z, 0-9, _) ---------------------------------------------------------------------- | OR ---------------------------------------------------------------------- ' '\'' ---------------------------------------------------------------------- )+ end of \2 (NOTE: because you're using a quantifier on this capture, only the LAST repetition of the captured pattern will be stored in \2) ---------------------------------------------------------------------- ) end of \1 ---------------------------------------------------------------------- ) end of grouping ----------------------------------------------------------------------