seki has asked for the wisdom of the Perl Monks concerning the following question:
Hi Monks,
I need to parse some (externally generated) html and ideally to get the contents of the body to produce some new content.
(For the curious, the current processing in production done by an indian-outsourced resource is to collect some files and to concatenate them into a single one (yes, with all individual doctypes, html, head and body tags), we are lucky that it even display something readable in a browser!)
So I thought about using HTML::TreeBuilder but some of the individual files are themselves not well-formed, with a content already wrapped in another file (Sure, when you have only a hammer, you see nails everywhere...) so my attempt to get the body results in a weird result:
Result (the two bodies seem mixed in a single item):use strict; use warnings; use HTML::Tree; my $tree = HTML::TreeBuilder->new; $tree->warn(1); my $content; { local $/ = undef; #slurp $content = <DATA>; } $tree->parse($content); foreach my $tag ( $tree->look_down('_tag', 'body') ) { print "------\n" . $tag->as_HTML; } $tree = $tree->delete; __DATA__ <!DOCTYPE html> <html> <head> <script>/*some ugly header stuff*/</script> </head> <body> <html> <head> <script>/*some embedded document*/</script> </head> <body> <h1>Hello</h1> <p>this is a test</p> <p>this is a second test</p> </body> </html> <p>some kind of wrapped footer</p> </body> </html>
HTML::Parse: Found a nested <html> element HTML::Parse: Found a second <head> element HTML::Parse: Found a second <body> element ------ <body><h1>Hello</h1><p>this is a test<p>this is a second test<p>some k +ind of wrapped footer </body>
How would you proceed to get the content of the inner html document? Use another package? I have looked for the options of HTML::Parser used by TreeBuilder but did not seen something relevant
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Parsing incorrect html
by Discipulus (Canon) on Jun 07, 2017 at 14:56 UTC | |
|
Re: Parsing incorrect html
by marto (Cardinal) on Jun 07, 2017 at 14:53 UTC | |
|
Re: Parsing incorrect html
by choroba (Cardinal) on Jun 07, 2017 at 16:39 UTC | |
|
Re: Parsing incorrect html
by haukex (Archbishop) on Jun 07, 2017 at 14:36 UTC | |
|
Re: Parsing incorrect html
by tybalt89 (Monsignor) on Jun 07, 2017 at 15:11 UTC | |
|
Re: Parsing incorrect html ("xml")
by Anonymous Monk on Jun 07, 2017 at 23:33 UTC | |
| A reply falls below the community's threshold of quality. You may see it by logging in. |