Parsing incorrect html

seki has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

I need to parse some (externally generated) html and ideally to get the contents of the body to produce some new content.

(For the curious, the current processing in production done by an indian-outsourced resource is to collect some files and to concatenate them into a single one (yes, with all individual doctypes, html, head and body tags), we are lucky that it even display something readable in a browser!)

So I thought about using HTML::TreeBuilder but some of the individual files are themselves not well-formed, with a content already wrapped in another file (Sure, when you have only a hammer, you see nails everywhere...) so my attempt to get the body results in a weird result:

use strict;
use warnings;
use HTML::Tree;

my $tree = HTML::TreeBuilder->new;
$tree->warn(1);

my $content;
{
    local $/ = undef; #slurp
    $content = <DATA>;
}

$tree->parse($content);
foreach my $tag ( $tree->look_down('_tag', 'body') ) {
    print "------\n" . $tag->as_HTML;
}

$tree = $tree->delete; 

__DATA__
<!DOCTYPE html>
<html>
    <head>
        <script>/*some ugly header stuff*/</script>
    </head>
    <body>
        <html>
            <head>
                <script>/*some embedded document*/</script>
            </head>
            <body>
                <h1>Hello</h1>
                <p>this is a test</p>
                <p>this is a second test</p>
            </body>
        </html>
        <p>some kind of wrapped footer</p>
    </body>
</html>
[download]

Result (the two bodies seem mixed in a single item):

HTML::Parse: Found a nested <html> element
HTML::Parse: Found a second <head> element
HTML::Parse: Found a second <body> element
------
<body><h1>Hello</h1><p>this is a test<p>this is a second test<p>some k
+ind of wrapped footer </body>
[download]

How would you proceed to get the content of the inner html document? Use another package? I have looked for the options of HTML::Parser used by TreeBuilder but did not seen something relevant

The best programs are the ones written when the programmer is supposed to be working on something else. - Melinda Varian

Comment on Parsing incorrect html Select or Download Code

Replies are listed 'Best First'.
Re: Parsing incorrect html by Discipulus (Canon) on Jun 07, 2017 at 14:56 UTC
Hello seki, I tried with XML::Twig and i got quite good results: see XML::Twig tutorial use strict; use warnings; use XML::Twig; my $t= XML::Twig->new( pretty_print => 'indented', twig_handlers => { # $_[1] is the elemen +t 'html/body/html' => sub{ $_[1]->print;} }); my $data =<<EOXML; <!DOCTYPE html> <html> <head> <script>/some ugly header stuff/</script> </head> <body> <html> <head> <script>/some embedded document/</script> </head> <body> <h1>Hello</h1> <p>this is a test</p> <p>this is a second test</p> </body> </html> <p>some kind of wrapped footer</p> </body> </html> EOXML $t->parse( $data); ## output <html> <head> <script>/some embedded document/</script> </head> <body> <h1>Hello</h1> <p>this is a test</p> <p>this is a second test</p> </body> </html> [download] L* There are no rules, there are no thumbs.. Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.	[reply] [d/l]
Re: Parsing incorrect html by marto (Cardinal) on Jun 07, 2017 at 14:53 UTC
"How would you proceed to get the content of the inner html document?" This will get you the inner HTML document: `use feature 'say'; use Mojo::DOM; my $html = '<!DOCTYPE html> <html> <head> <script>/some ugly header stuff/</script> </head> <body> <html> <head> <script>/some embedded document/</script> </head> <body> <h1>Hello</h1> <p>this is a test</p> <p>this is a second test</p> </body> </html> <p>some kind of wrapped footer</p> </body> </html>'; my $dom = Mojo::DOM->new( $html ); say $dom->at('html html')->child_nodes->first->remove;` [download] prints: `<html><head> <script>/some embedded document/</script> </head> <body> <h1>Hello</h1> <p>this is a test</p> <p>this is a second test</p> </body> </html>` [download] Mojo::DOM is very powerful, should you wish to extract or manipulate any of the subsequent HTML.	[reply] [d/l] [select]
Re: Parsing incorrect html by choroba (Cardinal) on Jun 07, 2017 at 16:39 UTC
Surprisingly, your weird HTML can be parsed even by the strict XML::LibXML : `#!/usr/bin/perl use warnings; use strict; use feature qw{ say }; use XML::LibXML; my $html = ...; my $dom = 'XML::LibXML'->load_xml(string => $html); for my $body ($dom->findnodes('//html/body')) { say '-' x 40; say for $body->findnodes('p'); }` [download] ($q=q:Sq=~/;[c](.)(.)/;chr(-\|\|-\|5+lengthSq)`"S\|oS2"`map{chr \|+ord }map{substrSq`S_+\|`\|}3E\|-\|`7**2-3:)=~y+S\|`+$1,++print+eval$q,q,a, [download]	[reply] [d/l] [select]
Re: Parsing incorrect html by haukex (Archbishop) on Jun 07, 2017 at 14:36 UTC
It appears that Mojo::DOM can handle that particular example HTML, this prints "`<p>this is a test</p>`" and "`<p>this is a second test</p>`": `use warnings; use strict; use Mojo::DOM; my $dom = Mojo::DOM->new( do { local $/; <DATA> } ); for my $e ($dom->find('html html > body p')->each) { print $e->to_string, "\n"; }` [download] Update: Switched the above from `find`ing the `<h1>` tag to the `<p>` tags, to show that it does not get confused like in your example. Update 2: Added newline for clarity.	[reply] [d/l] [select]
Re: Parsing incorrect html by tybalt89 (Monsignor) on Jun 07, 2017 at 15:11 UTC
Or extract each inner html section and parse it separately. #!/usr/bin/perl # http://perlmonks.org/?node_id=1192280 use strict; use warnings; $_ = do { local $/; <DATA> }; # find inner html sections my $nestcount = 1; while( s:( <html> # starting html ((?!</?html>).)? # no html or /html in the middle </html> # ending /html ): NON_NESTED_HTML $nestcount :sx ) { my $nonnestedhtml = $1; print "NON_NESTED_HTML $nestcount\n\n$nonnestedhtml\n\n"; $nestcount++; # here you can parse an inner non-nested html section } print "this is what's left\n\n$_\n\n"; __DATA__ <!DOCTYPE html> <html> <head> <script>/some ugly header stuff/</script> </head> <body> <html> <head> <script>/some embedded document/</script> </head> <body> <h1>Hello</h1> <p>this is a test</p> <p>this is a second test</p> </body> </html> <p>some kind of wrapped footer</p> </body> </html> [download] Produces: `NON_NESTED_HTML 1 <html> <head> <script>/some embedded document/</script> </head> <body> <h1>Hello</h1> <p>this is a test</p> <p>this is a second test</p> </body> </html> NON_NESTED_HTML 2 <html> <head> <script>/some ugly header stuff*/</script> </head> <body> NON_NESTED_HTML 1 <p>some kind of wrapped footer</p> </body> </html> this is what's left <!DOCTYPE html> NON_NESTED_HTML 2` [download] (parsing left as an exercise for the reader :)	[reply] [d/l] [select]
Re: Parsing incorrect html ("xml") by Anonymous Monk on Jun 07, 2017 at 23:33 UTC
See how htmltreexpather.pl does it, it uses this to parse "xml" `$tree->implicit_tags(0); $tree->no_expand_entities(1); $tree->ignore_unknown(0); $tree->ignore_ignorable_whitespace(0); $tree->no_space_compacting(1); $tree->store_comments(1); $tree->store_pis(1);` [download]	[reply] [d/l]
A reply falls below the community's threshold of quality. You may see it by logging in.