How to parse HTML5?

NRan has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I want to parse HTML5 tags. But unfortunately i can't do this due to some error( which i can't understand how to solve it)

I use HTML::Tidy;, and it can't parse <section>, It generates error

And Input is :-

<?xml version="1.0" encoding="utf-8"?><html xmlns:svg="http://www.w3.o
+rg 2000/svg" xmlns="http://www.w3.org/1999/xhtml" xmlns:m="http://www
+.w3.org 1998/Math/MathML" xml:lang="en" lang="en">
<head>
<link rel="stylesheet" type="text/css" title="day" href="../css/main.c
+ss"/>
<title>Electric Potential and Electric Potential Energy</title>
<meta charset="UTF-8"/>
<meta name="dcterms.conformsTo" content="PXE 1.39 ProductLevelReuse"/>
<meta name="generator" content="PXE Tools version 1.39.69"/>
</head>
<body>
<section class="chapter" ><header><h1 class="title"><span class="numbe
+r">20</span> Electric Potential and Electric Potential Energy</h1></h
+eader>
<section class="frontmatter">
<section class="listgroup"><header><h1 class="title">Big Ideas</h1></h
+eader>
<ol>
<li><p>Electric potential energy is similar to gravitational potential
+ energy.</p></li>
</ol>
</section>
</section>
</body>
</html>
[download]

My code is:-

use warnings ;
use strict;
use HTML::Tidy;

my $file_name ="d:/perl/test.xhtml";
undef $/;
open xhtml_file, '<:encoding(UTF-8)', "$file_name" || die "no htm file
+ found $!";
my $contents = <xhtml_file>;
close (xhtml_file);
$/ = "\n";

my $tidy = HTML::Tidy->new();
$tidy->ignore(
                text => qr/DOCTYPE/,
                text => qr/html/,
                text => qr/meta/,
                text => qr/header/
);
$tidy->parse( "foo.html", $contents );
for my $message ( $tidy->messages )
    {
        print $message->as_string, "\n";
    }
[download]

And Error Log is :-

foo.html (10:1) Error: <section> is not recognized!

foo.html (10:1) Warning: discarding unexpected <section>

foo.html (11:1) Error: <section> is not recognized!

foo.html (11:1) Warning: discarding unexpected <section>

foo.html (12:1) Error: <section> is not recognized!

foo.html (12:1) Warning: discarding unexpected <section>

foo.html (16:1) Warning: discarding unexpected </section>

foo.html (17:1) Warning: discarding unexpected </section>

Thanks
Nikhil Ranjan

Comment on How to parse HTML5? Select or Download Code

Replies are listed 'Best First'.
Re: How to parse HTML5? by Corion (Patriarch) on Mar 08, 2016 at 11:59 UTC
Maybe `<section>` tags are not supposed to be nested and that's why HTML::Tidy is complaining about them? Why do you want to parse HTML? Personally, I like HTML::TreeBuilder, which gives me a tree I can later query for content. If you want to clean up HTML, maybe you can use the `->as_HTML` method of the resulting HTML::Element to pretty-print it.	[reply] [d/l] [select]
Re^2: How to parse HTML5? by NRan (Novice) on Mar 08, 2016 at 12:13 UTC
could you give me any short example with "HTML::TreeBuilder" But i also need error log with "line number" and "column number" Then after end user go that line number and then correct it Is it possible? Thanks Nikhil Ranjan	[reply]
Re^3: How to parse HTML5? by Corion (Patriarch) on Mar 08, 2016 at 12:17 UTC
What kind of errors do you want? If you're after finding malformed HTML, HTML::Tidy is better, because HTML::TreeBuilder will automatically correct much of the HTML.	[reply]
Re^4: How to parse HTML5? by NRan (Novice) on Mar 08, 2016 at 12:20 UTC
Re^5: How to parse HTML5? by poj (Abbot) on Mar 08, 2016 at 13:28 UTC
Some notes below your chosen depth have not been shown here
Re: How to parse HTML5? by CountZero (Bishop) on Mar 08, 2016 at 16:20 UTC
HTML::Tidy does not parse HTML 5! (see http://tidyp.com homepage) `section` is an HTML 5 feature. Hence, you get errors with HTML::Tidy if you feed it HTML that contains `section` tags. CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James My blog: Imperial Deltronics	[reply] [d/l] [select]
Re: How to parse HTML5? by dasgar (Priest) on Mar 08, 2016 at 17:52 UTC
Did a little search and found HTML::Valid. It looks like it supports HTML5 and it looks like it will provide error messages that include the line number and column number that you are looking for.	[reply]
Re: How to parse HTML5? by choroba (Cardinal) on Mar 08, 2016 at 12:57 UTC
Crossposted to StackOverflow. It's considered polite to inform about crossposting, so that people not attending both sites don't waste their efforts hacking a solution for a problem already solved at the other end of the internet. ($q=q:Sq=~/;[c](.)(.)/;chr(-\|\|-\|5+lengthSq)`"S\|oS2"`map{chr \|+ord }map{substrSq`S_+\|`\|}3E\|-\|`7**2-3:)=~y+S\|`+$1,++print+eval$q,q,a, [download]	[reply] [d/l]
Re^2: How to parse HTML5? by Anonymous Monk on Mar 09, 2016 at 09:35 UTC
Still this problem is not solved in both end. Okay, And i am not wasting my time as well as your time. I just try to solve my problem. If you can then thanks other wise it's okay.	[reply]
Re^3: How to parse HTML5? by dasgar (Priest) on Mar 10, 2016 at 06:42 UTC
Out of curiosity, why do you say that the problem has not been solved? Prior to your claim that "this problem is not solved in both end", I posted to both forums (here at PerlMonks and here at Stack Overflow) a suggestion to check out HTML::Valid. According to the documentation of HTML::Tidy, you need to have tidyp installed first and tidyp appears to be a fork of tidy and that site indicates that it is the "HTML Tidy Legacy Website". The HTML::Valid module is based on the HTML Tidy project and it does support HTML5. And I'll take it a bit further. Here's a demonstration of HTML::Valid on the OP's posted HTML/XHTML data. I created a test.html file with the following content (from the OP): Read more... (2 kB) And here's the Perl code that uses HTML::Valid to check that file: Read more... (527 Bytes) And here's the output of that script: Read more... (2 kB) That shows that HTML::Valid is not having issues dealing with <section> tags and that is also provides line numbers and column numbers as the OP stated here as something that was needed. Unfortunately it looks like HTML::Valid does not have an ignore method that was in the OP's code had that used HTML::Tidy, so the OP may need to write a little bit more code to parse out the messages concerning tags that the OP wants to ignore. Unless I totally misunderstood what the "problem" was, it looks like HTML::Valid "solves" the "problem".	[reply] [d/l] [select]
Re^4: How to parse HTML5? by NRan (Novice) on Mar 10, 2016 at 12:00 UTC
Re^5: How to parse HTML5? by dasgar (Priest) on Mar 11, 2016 at 05:59 UTC
Re^5: How to parse HTML5? by NRan (Novice) on Mar 10, 2016 at 12:12 UTC
Re^4: How to parse HTML5? by sandeepb (Novice) on Mar 10, 2016 at 10:55 UTC


There's more than one way to do things
	PerlMonks