Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to read the entire contents of a file between the two BODY tags and then save that output to a new file. I am using the following code but it will only work if the BODY tags are on the same line:
if (/\<body>(.*)<\/body\>/i){ $body_temp = $_; $body_temp =~ s/(.*?)\<body\>(.*?)\<\/body\>/$2/i; chomp($body_temp);<br> $body = "\<body\>" . $body_temp . "\<\/body\>"; print OUTFILE $body . "\n"; $found_body = 1; }


[ Added code tags - ar0n ]

Replies are listed 'Best First'.
Re: How do I read the contents of an HTML file between two BODY tags?
by Ovid (Cardinal) on Apr 27, 2001 at 03:17 UTC
    First of all, you need to add an 's' to the end of the regular expression to get the dot to match newlines. Second, you really want to use something like HTML::Parser. HTML can vary widely in formatting and needs to be parsed. Regular expressions are best at matching text, not parsing it.
Re: How do I read the contents of an HTML file between two BODY tags?
by little (Curate) on Apr 27, 2001 at 17:07 UTC
    and think about that the body tag can have attributes so you would fail to match <body> in case yo have a page with
    <body leftmargin="30" topmargin="50">
    But I recommend only to also have a look into HTML::Tokeparser
Re: How do I read the contents of an HTML file between two BODY tags?
by diarmuid (Beadle) on Apr 27, 2001 at 18:35 UTC
    well I would use a construct I came across recently ...
    for example

    open(HTML,"test.html") || die "cant open file\n"; while(<HTML>){ if(/<body.*?>/i ... /<\/body.*?>/i){ print OUTFILE $_; } } close HTML;
    The if (/regexp/.../regexp2/) evaluates to true if $_ is between the two regexp (even across lines)
    The .*? just matches the tags inside the body tag eg bgcolor etc...

    sweet
    Diarmiuid

      I realize this thread is long dead, but I just stumbled across it and found it useful. However, that construct would include the body tags in the output. To exclude them, add another expression:
      open(HTML,"test.html") || die "cant open file\n"; while(<HTML>){ if(/<body.*?>/i ... /<\/body.*?>/i){ s/\<\/*body.*?>//ig; # Strip body tags from $_ print OUTFILE $_; } } close HTML;