Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

how would I get *everything* between two html tags with a simple regex if *everything* include newlines? I've tried this, but I get an unmatched() error:

if($stuff =~ /<head>([.\n]+)</\head>){ $what_i_want = $1; }

Thanks for the help.

Replies are listed 'Best First'.
Re: a regex to parse html tags
by wog (Curate) on Jul 05, 2001 at 06:54 UTC
    First, use the HTML::Parser module or HTML::TokeParser, or in your case, probably the HTML::HeadParser module to parse the HTML. Regular expressions won't work. What if you have an HTML document with like:
    <!-- I changed this, it was just <head><title></title></head> - djb (03 Jul 2001) --> <head> <title>Blah</title> <meta name="DESCRIPTION" value="About </head> tags."> </head>
    That's a whole lot harder to parse with regular expression.

    That said [.\n] creates a character class matching a period and a newline. The []s interperate .s as not special. You could use the /s modifier (see perlre) and just use .+ instead. The /s modifier will make . match even newlines. Another way is to use (?:.|\n) which is the same as (.|\n) except that it doesn't capture anything (into the $<digit> variables.)

    Also you need to actually escape / in regexs if you are using / as the deliminator with like: \/, or you can avoid that ugliness by using an alternate deliminator (like m!regex goes here! or m(regex).) I assume the lack of a / at the end of your regex is an error made in posting your code here.

    update: To give another example of why not to use a regex: <head something="someattribute">...</head> won't be handled by a simple regex either.

    update 2: fixed typo of </head> where </title> was meant and other minor typos.

      Just one small addition to the problem of matching any character. The /s modifier is definitely the way to go, so that . matches everything including newlines.

      If for some reason you don't want to use the \s modifier - maybe you have other dots in your regex which should not match newlines - you should use a character class. The advantage over (?:.|\n) is that no backtracking has to be done.

      # character class matching any one character /[\000-\377]/ # or equivalent /[\d\D]/

      -- Hofmator

Re: a regex to parse html tags
by Anonymous Monk on Jul 05, 2001 at 06:56 UTC
    Use the s modifier, here are two that usually work:
    if($stuff =~ /<head>(.*?)<\/head>/s){ $what_i_want = $1; } if($stuff =~ /<head>([^<]+)<\/head>/s){ $what_i_want = $1; }
Re: a regex to parse html tags
by elusion (Curate) on Jul 05, 2001 at 06:36 UTC
    It would seem you're missing the closing slash on your regex.

    - p u n k k i d
    "Reality is merely an illusion, albeit a very persistent one." -Albert Einstein

Re: a regex to parse html tags
by dsb (Chaplain) on Jul 05, 2001 at 23:32 UTC
    You could perhaps try something with a negated character class:
    $var =~ m%<HEAD>([^<]+)</HEAD%i; print $1, "\n";
    The '^' at the beginning of the character class creates a negated character class, which essentially means:
    "Match anything that is not <"

    You could also use HTML::Parser and the like, however if you are only trying to match the <HEAD> tags then using the module might be unnecessary.

    Amel - f.k.a. - kel

Re: a regex to parse html tags
by sierrathedog04 (Hermit) on Jul 05, 2001 at 14:48 UTC
    Add /gs to the end of your regex, e.g., s/hello/goodbye/gs

    g will perform multiple matches instead of just one (you said you wanted everything.) s will perform the match across newlines.

Re: a regex to parse html tags
by Anonymous Monk on Jul 05, 2001 at 06:53 UTC
    whoops! assume the closing slash is there