comment on

Hi,

I want to parse HTML5 tags. But unfortunately i can't do this due to some error( which i can't understand how to solve it)

I use HTML::Tidy;, and it can't parse <section>, It generates error

And Input is :-

<?xml version="1.0" encoding="utf-8"?><html xmlns:svg="http://www.w3.o
+rg 2000/svg" xmlns="http://www.w3.org/1999/xhtml" xmlns:m="http://www
+.w3.org 1998/Math/MathML" xml:lang="en" lang="en">
<head>
<link rel="stylesheet" type="text/css" title="day" href="../css/main.c
+ss"/>
<title>Electric Potential and Electric Potential Energy</title>
<meta charset="UTF-8"/>
<meta name="dcterms.conformsTo" content="PXE 1.39 ProductLevelReuse"/>
<meta name="generator" content="PXE Tools version 1.39.69"/>
</head>
<body>
<section class="chapter" ><header><h1 class="title"><span class="numbe
+r">20</span> Electric Potential and Electric Potential Energy</h1></h
+eader>
<section class="frontmatter">
<section class="listgroup"><header><h1 class="title">Big Ideas</h1></h
+eader>
<ol>
<li><p>Electric potential energy is similar to gravitational potential
+ energy.</p></li>
</ol>
</section>
</section>
</body>
</html>
[download]

My code is:-

use warnings ;
use strict;
use HTML::Tidy;

my $file_name ="d:/perl/test.xhtml";
undef $/;
open xhtml_file, '<:encoding(UTF-8)', "$file_name" || die "no htm file
+ found $!";
my $contents = <xhtml_file>;
close (xhtml_file);
$/ = "\n";

my $tidy = HTML::Tidy->new();
$tidy->ignore(
                text => qr/DOCTYPE/,
                text => qr/html/,
                text => qr/meta/,
                text => qr/header/
);
$tidy->parse( "foo.html", $contents );
for my $message ( $tidy->messages )
    {
        print $message->as_string, "\n";
    }
[download]

And Error Log is :-

foo.html (10:1) Error: <section> is not recognized!

foo.html (10:1) Warning: discarding unexpected <section>

foo.html (11:1) Error: <section> is not recognized!

foo.html (11:1) Warning: discarding unexpected <section>

foo.html (12:1) Error: <section> is not recognized!

foo.html (12:1) Warning: discarding unexpected <section>

foo.html (16:1) Warning: discarding unexpected </section>

foo.html (17:1) Warning: discarding unexpected </section>

Thanks
Nikhil Ranjan

In reply to How to parse HTML5? by NRan

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.