NRan has asked for the wisdom of the Perl Monks concerning the following question:
Hi,
I want to parse HTML5 tags. But unfortunately i can't do this due to some error( which i can't understand how to solve it)
I use HTML::Tidy;, and it can't parse <section>, It generates error
And Input is :-
<?xml version="1.0" encoding="utf-8"?><html xmlns:svg="http://www.w3.o +rg 2000/svg" xmlns="http://www.w3.org/1999/xhtml" xmlns:m="http://www +.w3.org 1998/Math/MathML" xml:lang="en" lang="en"> <head> <link rel="stylesheet" type="text/css" title="day" href="../css/main.c +ss"/> <title>Electric Potential and Electric Potential Energy</title> <meta charset="UTF-8"/> <meta name="dcterms.conformsTo" content="PXE 1.39 ProductLevelReuse"/> <meta name="generator" content="PXE Tools version 1.39.69"/> </head> <body> <section class="chapter" ><header><h1 class="title"><span class="numbe +r">20</span> Electric Potential and Electric Potential Energy</h1></h +eader> <section class="frontmatter"> <section class="listgroup"><header><h1 class="title">Big Ideas</h1></h +eader> <ol> <li><p>Electric potential energy is similar to gravitational potential + energy.</p></li> </ol> </section> </section> </body> </html>
My code is:-
use warnings ; use strict; use HTML::Tidy; my $file_name ="d:/perl/test.xhtml"; undef $/; open xhtml_file, '<:encoding(UTF-8)', "$file_name" || die "no htm file + found $!"; my $contents = <xhtml_file>; close (xhtml_file); $/ = "\n"; my $tidy = HTML::Tidy->new(); $tidy->ignore( text => qr/DOCTYPE/, text => qr/html/, text => qr/meta/, text => qr/header/ ); $tidy->parse( "foo.html", $contents ); for my $message ( $tidy->messages ) { print $message->as_string, "\n"; }
And Error Log is :-
foo.html (10:1) Error: <section> is not recognized!
foo.html (10:1) Warning: discarding unexpected <section>
foo.html (11:1) Error: <section> is not recognized!
foo.html (11:1) Warning: discarding unexpected <section>
foo.html (12:1) Error: <section> is not recognized!
foo.html (12:1) Warning: discarding unexpected <section>
foo.html (16:1) Warning: discarding unexpected </section>
foo.html (17:1) Warning: discarding unexpected </section>
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: How to parse HTML5?
by Corion (Patriarch) on Mar 08, 2016 at 11:59 UTC | |
by NRan (Novice) on Mar 08, 2016 at 12:13 UTC | |
by Corion (Patriarch) on Mar 08, 2016 at 12:17 UTC | |
by NRan (Novice) on Mar 08, 2016 at 12:20 UTC | |
by poj (Abbot) on Mar 08, 2016 at 13:28 UTC | |
| |
|
Re: How to parse HTML5?
by CountZero (Bishop) on Mar 08, 2016 at 16:20 UTC | |
|
Re: How to parse HTML5?
by dasgar (Priest) on Mar 08, 2016 at 17:52 UTC | |
|
Re: How to parse HTML5?
by choroba (Cardinal) on Mar 08, 2016 at 12:57 UTC | |
by Anonymous Monk on Mar 09, 2016 at 09:35 UTC | |
by dasgar (Priest) on Mar 10, 2016 at 06:42 UTC | |
by NRan (Novice) on Mar 10, 2016 at 12:00 UTC | |
by dasgar (Priest) on Mar 11, 2016 at 05:59 UTC | |
by NRan (Novice) on Mar 10, 2016 at 12:12 UTC | |
by sandeepb (Novice) on Mar 10, 2016 at 10:55 UTC |