eweaverp has asked for the wisdom of the Perl Monks concerning the following question:

Bonjour Monks...

I have a very newbie question. I am trying to write a regex that will extract the text between
QBlastInfoEnd -->
and
</form>
I tried doing this:
$re = qr[QBlastInfoEnd \s+ --> ([.\n]*) \s* </form>]x; $html =~ $re; print $1;
but it doesn't work; %1 is empty. Can someone show me an obvious way to do this? Or do I have some other issue causing the problem?

Thanks...
~evan

Replies are listed 'Best First'.
Re: regex
by tall_man (Parson) on Jun 19, 2003 at 01:49 UTC
    The "s" option might help you, because it would get "." to match "\n".
    $re = qr[QBlastInfoEnd \s+ --> (.*) </form>]xs;
    That "\s*" was useless because ".*" will greedily match all spaces, too. If there's more than one "</form>" in your file, watch out because ".*" will match up to the last one. See Death to Dot Star!.
Re: regex
by artist (Parson) on Jun 19, 2003 at 02:02 UTC
    's' option should help.
    Assuming non-greedy search
    $re = qr[$start(.*?)$end]s;
    should do the work.

    Also read perldoc perlre. Here is the extract :
    m : Treat string as multiple lines. That is, change "^" and "$" from matching the start or end of the string to matching the start or end of any line anywhere within the string.

    s : Treat string as single line. That is, change "." to match any character whatsoever, even a newline, which normally it would not match.

    The "/s" and "/m" modifiers both override the "$*" setting. That is, no matter what "$*" contains, "/s" without "/m" will force "^" to match only at the beginning of the string and "$" to match only at the end (or just before a newline at the end) of the string.

    artist

Re: regex
by TomDLux (Vicar) on Jun 19, 2003 at 03:12 UTC

    Actually, my first thought on reading this was to make sure that you read in the whole document, or at least all the relevant section. If you're reading line-by-line, you can match on a regional basis:

    if ( m|QBlastInfoEnd|...0 ) { # continue till end of file if ( m|-->|...m|</form>| ) { print; } }

    --
    TTTATCGGTCGTTATATAGATGTTTGCA

Re: regex
by Nkuvu (Priest) on Jun 19, 2003 at 03:19 UTC

    Huh. My first thought is that your qr operator is using [ and ] as delimiters. And you have [.\n] in your regex. Seem like a problem?

    I only briefly read the other replies, to see if someone else pointed this out. So this might not be the problem at all -- or the only problem. But...

      No, perl seems to be able to parse the inner "[]" correctly even with the same characters as outer delimiters. The real problem with the original pattern is that "." is not a metacharacter inside character classes. The given class matched only real "."s and newlines. That's why I suggested the "s" option.

      According to Mastering Regular Expressions, p. 10:

      ...to be clear, the only special characters within the class in [0-9A-Z_!.?] are the two dashes.

        Wow, learn something new every day. I looked it up in perldoc perlop and found this relevant bit:

        Non-bracketing delimiters use the same character fore and aft, but the four sorts of brackets (round, angle, square, curly) will all nest, which means that q{foo{bar}baz} is the same as 'foo{bar}baz'

        And if I'd taken more than a second to glance at it, I would have noticed the non-metacharacter '.' inside the square brackets. I'm familiar with that particular pitfall, having done it in my own code many times. But thanks for pointing it out.