Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Regular Expression

by Anonymous Monk
on Jun 28, 2005 at 18:56 UTC ( [id://470757]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Using regular expressions how can I match the body tag If the body tag is simple like <body> Or if the body tag is complex <body onLoad="..." leftmargin=0 ...> How can match all of those with regex?

Replies are listed 'Best First'.
Re: Regular Expression
by waswas-fng (Curate) on Jun 28, 2005 at 19:06 UTC
Re: Regular Expression
by Transient (Hermit) on Jun 28, 2005 at 19:01 UTC
        good
Re: Regular Expression
by davidrw (Prior) on Jun 28, 2005 at 19:23 UTC
    See this node: Regular Expressions for almost the exact same question.
    Why can't you use modules? The most robust way will be something like HTML::Parser -- look specifically at the examples section for extracting the <title> tag.

    for one-time quick & dirty, use a regex (this assumes, of course, that there isn't a > in the onLoad javascript):
    if( $html =~ /<body (.*?)>/si ){ my $body_attributes = $1; }
    Maybe something like this will help guard against javascript screwing up the match, but assumes proper quoting of the attributes:
    /<body((?:\s+(?:\w+=".*?"))*)>/si
    Update: added strike and bold after reading/noting ikegami's response
      I was directed to the second regexp this post as a solution that fixes problems in another post, but it's no better.
      but assumes proper quoting of the attributes:

      The HTML spec allows for single quotes, and even allows for the quotes to be omitted in some circumstances, so no, it doesn't assume proper quoting.

      Also, it doesn't handle > inside of quotes (where it doesn't need to be escaped).

      Finally, it could locate <body> inside of a comment or inside of another attribute.

Re: Regular Expression
by fmerges (Chaplain) on Jun 28, 2005 at 21:08 UTC

    Hi,

    I use this for example in Mason components filters:

    $html =~ s{(<body.*?)>}{$1 onLoad="window.print()">}is;

    Hope it can be helpfull

    Regards,

    :-)
Re: Regular Expression
by kwaping (Priest) on Jun 28, 2005 at 19:18 UTC
    my $body = $the_html_string; $body =~ s/(<body.*?>)/$1/sgi;
      What will happen when a <body> tag is included in comments? Your regex will break. I'm almost sure that the one who gave the instruction to find the <body> of the HTML code with a simple regex did not think about this.

      CountZero

      "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

        Interesting observation - do you often see body tags enclosed in comments? You are assuming the poster doesn't want body tags enclosed in comments. ;) In any case, I think this pattern is better (added ^):
        my $body = $the_html_string; $body =~ s/^.*?(<body.*?>)/$1/sgi;
      THat worked. Thanks!
Re: Regular Expression
by l.frankline (Hermit) on Jun 29, 2005 at 14:55 UTC
    u can try like this....

    $_ =~ /<body[^>]*>/;

    * Frank *
Re: Regular Expression
by Anonymous Monk on Jun 28, 2005 at 19:14 UTC
    Unfortunately for my purpose I cannot use that module :-) I need to use strict regex
      Homework? Or some fundamental religious rule which forbids you to handle HTML other than with a regex?

      CountZero

      "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

        ROTFL.


        -Waswas
      <flame suit on>
      $html =~ /(<body[^>]+>)/s;

      $html =~ /(<body[^>]*>)/s;
      that should get you the opening tag.

      Update: Changed + to * as pointed out by kwaping
        That'll fail for <body onload="if (a > b) { ... }">. It can also fail if <body> is found in comments.
        And how do you find the closing tag?

        CountZero

        "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://470757]
Approved by kirbyk
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chilling in the Monastery: (2)
As of 2024-04-26 00:45 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found