in reply to Re^2: How to parse HTML5?
in thread How to parse HTML5?
Out of curiosity, why do you say that the problem has not been solved?
Prior to your claim that "this problem is not solved in both end", I posted to both forums (here at PerlMonks and here at Stack Overflow) a suggestion to check out HTML::Valid.
According to the documentation of HTML::Tidy, you need to have tidyp installed first and tidyp appears to be a fork of tidy and that site indicates that it is the "HTML Tidy Legacy Website". The HTML::Valid module is based on the HTML Tidy project and it does support HTML5.
And I'll take it a bit further. Here's a demonstration of HTML::Valid on the OP's posted HTML/XHTML data.
I created a test.html file with the following content (from the OP):
<?xml version="1.0" encoding="utf-8"?><html xmlns:svg="http://www.w3.o +rg 2000/svg" xmlns="http://www.w3.org/1999/xhtml" xmlns:m="http://www +.w3.org 1998/Math/MathML" xml:lang="en" lang="en"> <head> <link rel="stylesheet" type="text/css" title="day" href="../css/main.c +ss"/> <title>Electric Potential and Electric Potential Energy</title> <meta charset="UTF-8"/> <meta name="dcterms.conformsTo" content="PXE 1.39 ProductLevelReuse"/> <meta name="generator" content="PXE Tools version 1.39.69"/> </head> <body> <section class="chapter" ><header><h1 class="title"><span class="numbe +r">20</span> Electric Potential and Electric Potential Energy</h1></h +eader> <section class="frontmatter"> <section class="listgroup"><header><h1 class="title">Big Ideas</h1></h +eader> <ol> <li><p>Electric potential energy is similar to gravitational potential + energy.</p></li> </ol> </section> </section> </body> </html>
And here's the Perl code that uses HTML::Valid to check that file:
use strict; use warnings; use feature 'say'; use HTML::Valid; use Path::Tiny; my $file = 'test.html'; my $content = path($file)->slurp; my $validate = HTML::Valid->new(); my ($output,$errors) = $validate->run($content); say "Output:\n$output\n"; say "Errors:\n$errors";
And here's the output of that script:
Output: <?xml version="1.0" encoding="utf-8"?> <!DOCTYPE html> <html xmlns:svg="http://www.w3.org 2000/svg" xmlns= "http://www.w3.org/1999/xhtml" xmlns:m= "http://www.w3.org 1998/Math/MathML" xml:lang="en" lang="en"> <head> <meta name="generator" content= "HTML Tidy for HTML5 for Windows version 5.0.0" /> <link rel="stylesheet" type="text/css" title="day" href= "../css/main.css" /> <title>Electric Potential and Electric Potential Energy</title> <meta charset="UTF-8" /> <meta name="dcterms.conformsTo" content= "PXE 1.39 ProductLevelReuse" /> <meta name="generator" content="PXE Tools version 1.39.69" /> </head> <body> <section class="chapter"> <header> <h1 class="title"><span class="number">20</span> Electric Potential and Electric Potential Energy</h1> </header> <section class="frontmatter"> <section class="listgroup"> <header> <h1 class="title">Big Ideas</h1> </header> <ol> <li> <p>Electric potential energy is similar to gravitational potential energy.</p> </li> </ol> </section> </section> </section> </body> </html> Errors: line 1 column 39 - Warning: missing <!DOCTYPE> declaration line 10 column 1 - Warning: missing </section> line 1 column 39 - Warning: <html> proprietary attribute "xmlns:svg" line 1 column 39 - Warning: <html> proprietary attribute "xmlns:m" Info: Document content looks like XHTML5 4 warnings, 0 errors were found!
That shows that HTML::Valid is not having issues dealing with <section> tags and that is also provides line numbers and column numbers as the OP stated here as something that was needed. Unfortunately it looks like HTML::Valid does not have an ignore method that was in the OP's code had that used HTML::Tidy, so the OP may need to write a little bit more code to parse out the messages concerning tags that the OP wants to ignore.
Unless I totally misunderstood what the "problem" was, it looks like HTML::Valid "solves" the "problem".
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^4: How to parse HTML5?
by NRan (Novice) on Mar 10, 2016 at 12:00 UTC | |
by dasgar (Priest) on Mar 11, 2016 at 05:59 UTC | |
by NRan (Novice) on Mar 10, 2016 at 12:12 UTC | |
|
Re^4: How to parse HTML5?
by sandeepb (Novice) on Mar 10, 2016 at 10:55 UTC |