kotoko has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I need to take out the "grey heron" of the following string.
"ERROR: Sequence grey heron consists entirely of undetermined values which will be treated as missing data"

The problem is that the name is totally unpredictable so the Regex needs to work for something like "humming bird" and all other multi-word names of living beings.

I think what I should say is get all characters between Sequence and consists but have no idea how to say that.

Replies are listed 'Best First'.
Re: Regxp: signaling when to stop
by Zaxo (Archbishop) on Jun 20, 2007 at 15:09 UTC

    If your idea is sufficient, here's how to do it:

    my ($name) = m/sequence (.*?) consists/i;
    assuming that $_ contains the line you're parsing. If whitespace is inconsistent, you can canonicalize it with,
    $_ = join ' ', split;
    before applying the match.

    After Compline,
    Zaxo

Re: Regxp: signaling when to stop
by Sidhekin (Priest) on Jun 20, 2007 at 15:20 UTC

    Not just to be contrary, but this is no place for the non-greedy quantifiers. What if the "unpredictable sequence" contains the string "consists"?

    On the assumption that the rest of the string is less "unpredictable", you want to match everything on this line between the first "sequence" and the last "consists". Hence my suggestion:

    my ($name) = m/Sequence (.*) consists/;

    (If the rest of the string may be as unpredictable, I'd suggest you write a full parser instead. Looks like there's a fair chance you won't have to though.)

    print "Just another Perl ${\(trickster and hacker)},"
    The Sidhekin proves Sidhe did it!

Re: Regxp: signaling when to stop
by Fletch (Bishop) on Jun 20, 2007 at 15:06 UTC

    Consult the docs for "non-greedy" quantifiers. /Sequence \s+ (.*?) \s+ consists/x

Re: Regxp: signaling when to stop
by Trizor (Pilgrim) on Jun 20, 2007 at 15:14 UTC

    Simply use capturing groups like this: (You also want a non greedy operator as other posts have pointed out.)

    #The regex, commented for your convinence. $Error =~ /Sequence\s #opening anchor, don't capture ( # open capturing group to get target pattern .+? # target pattern, non-greedy catch all ) # close group \sconsists # closing anchor, don't capture /x; # end regex

    After that your target name will be inside $1. If you want to place it all on one line, you can use =~ in list context and get the results as a list like this:

    ($Missing_Living_Being) = $Error =~ /Sequence\s(.+?)consists/;

    Just be sure that the list that you're assigning to has the same number of elements as there are capturing groups (the parts of the pattern inside parens) or else you'll run into trouble with values going to the wrong place. If you're paranoid about this you could assign to an array and shift it to get the first group, but that adds a lot of uneccessary overhead.

    Edit: Forgot a semi-colon.
    Edit 2: Made the assumption that all living beings have two word names.