"How would you proceed to get the content of the inner html document?"
This will get you the inner HTML document:
use feature 'say'; use Mojo::DOM; my $html = '<!DOCTYPE html> <html> <head> <script>/*some ugly header stuff*/</script> </head> <body> <html> <head> <script>/*some embedded document*/</script> </head> <body> <h1>Hello</h1> <p>this is a test</p> <p>this is a second test</p> </body> </html> <p>some kind of wrapped footer</p> </body> </html>'; my $dom = Mojo::DOM->new( $html ); say $dom->at('html html')->child_nodes->first->remove;
prints:
<html><head> <script>/*some embedded document*/</script> </head> <body> <h1>Hello</h1> <p>this is a test</p> <p>this is a second test</p> </body> </html>
Mojo::DOM is very powerful, should you wish to extract or manipulate any of the subsequent HTML.
In reply to Re: Parsing incorrect html
by marto
in thread Parsing incorrect html
by seki
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |