in reply to Parsing incorrect html
I tried with XML::Twig and i got quite good results: see XML::Twig tutorial
use strict; use warnings; use XML::Twig; my $t= XML::Twig->new( pretty_print => 'indented', twig_handlers => { # $_[1] is the elemen +t 'html/body/html' => sub{ $_[1]->print;} }); my $data =<<EOXML; <!DOCTYPE html> <html> <head> <script>/*some ugly header stuff*/</script> </head> <body> <html> <head> <script>/*some embedded document*/</script> </head> <body> <h1>Hello</h1> <p>this is a test</p> <p>this is a second test</p> </body> </html> <p>some kind of wrapped footer</p> </body> </html> EOXML $t->parse( $data); ## output <html> <head> <script>/*some embedded document*/</script> </head> <body> <h1>Hello</h1> <p>this is a test</p> <p>this is a second test</p> </body> </html>
L*
|
|---|