Hi Monks,
I need to parse some (externally generated) html and ideally to get the contents of the body to produce some new content.
(For the curious, the current processing in production done by an indian-outsourced resource is to collect some files and to concatenate them into a single one (yes, with all individual doctypes, html, head and body tags), we are lucky that it even display something readable in a browser!)
So I thought about using HTML::TreeBuilder but some of the individual files are themselves not well-formed, with a content already wrapped in another file (Sure, when you have only a hammer, you see nails everywhere...) so my attempt to get the body results in a weird result:
Result (the two bodies seem mixed in a single item):use strict; use warnings; use HTML::Tree; my $tree = HTML::TreeBuilder->new; $tree->warn(1); my $content; { local $/ = undef; #slurp $content = <DATA>; } $tree->parse($content); foreach my $tag ( $tree->look_down('_tag', 'body') ) { print "------\n" . $tag->as_HTML; } $tree = $tree->delete; __DATA__ <!DOCTYPE html> <html> <head> <script>/*some ugly header stuff*/</script> </head> <body> <html> <head> <script>/*some embedded document*/</script> </head> <body> <h1>Hello</h1> <p>this is a test</p> <p>this is a second test</p> </body> </html> <p>some kind of wrapped footer</p> </body> </html>
HTML::Parse: Found a nested <html> element HTML::Parse: Found a second <head> element HTML::Parse: Found a second <body> element ------ <body><h1>Hello</h1><p>this is a test<p>this is a second test<p>some k +ind of wrapped footer </body>
How would you proceed to get the content of the inner html document? Use another package? I have looked for the options of HTML::Parser used by TreeBuilder but did not seen something relevant
In reply to Parsing incorrect html by seki
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |