comment on

Hi Monks,

I need to parse some (externally generated) html and ideally to get the contents of the body to produce some new content.

(For the curious, the current processing in production done by an indian-outsourced resource is to collect some files and to concatenate them into a single one (yes, with all individual doctypes, html, head and body tags), we are lucky that it even display something readable in a browser!)

So I thought about using HTML::TreeBuilder but some of the individual files are themselves not well-formed, with a content already wrapped in another file (Sure, when you have only a hammer, you see nails everywhere...) so my attempt to get the body results in a weird result:

use strict;
use warnings;
use HTML::Tree;

my $tree = HTML::TreeBuilder->new;
$tree->warn(1);

my $content;
{
    local $/ = undef; #slurp
    $content = <DATA>;
}

$tree->parse($content);
foreach my $tag ( $tree->look_down('_tag', 'body') ) {
    print "------\n" . $tag->as_HTML;
}

$tree = $tree->delete; 

__DATA__
<!DOCTYPE html>
<html>
    <head>
        <script>/*some ugly header stuff*/</script>
    </head>
    <body>
        <html>
            <head>
                <script>/*some embedded document*/</script>
            </head>
            <body>
                <h1>Hello</h1>
                <p>this is a test</p>
                <p>this is a second test</p>
            </body>
        </html>
        <p>some kind of wrapped footer</p>
    </body>
</html>
[download]

Result (the two bodies seem mixed in a single item):

HTML::Parse: Found a nested <html> element
HTML::Parse: Found a second <head> element
HTML::Parse: Found a second <body> element
------
<body><h1>Hello</h1><p>this is a test<p>this is a second test<p>some k
+ind of wrapped footer </body>
[download]

How would you proceed to get the content of the inner html document? Use another package? I have looked for the options of HTML::Parser used by TreeBuilder but did not seen something relevant

The best programs are the ones written when the programmer is supposed to be working on something else. - Melinda Varian

In reply to Parsing incorrect html by seki

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.