Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

How to parse HTML5?

by NRan (Novice)
on Mar 08, 2016 at 11:35 UTC ( [id://1157066]=perlquestion: print w/replies, xml ) Need Help??

NRan has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I want to parse HTML5 tags. But unfortunately i can't do this due to some error( which i can't understand how to solve it)

I use HTML::Tidy;, and it can't parse <section>, It generates error

And Input is :-

<?xml version="1.0" encoding="utf-8"?><html xmlns:svg="http://www.w3.o +rg 2000/svg" xmlns="http://www.w3.org/1999/xhtml" xmlns:m="http://www +.w3.org 1998/Math/MathML" xml:lang="en" lang="en"> <head> <link rel="stylesheet" type="text/css" title="day" href="../css/main.c +ss"/> <title>Electric Potential and Electric Potential Energy</title> <meta charset="UTF-8"/> <meta name="dcterms.conformsTo" content="PXE 1.39 ProductLevelReuse"/> <meta name="generator" content="PXE Tools version 1.39.69"/> </head> <body> <section class="chapter" ><header><h1 class="title"><span class="numbe +r">20</span> Electric Potential and Electric Potential Energy</h1></h +eader> <section class="frontmatter"> <section class="listgroup"><header><h1 class="title">Big Ideas</h1></h +eader> <ol> <li><p>Electric potential energy is similar to gravitational potential + energy.</p></li> </ol> </section> </section> </body> </html>

My code is:-

use warnings ; use strict; use HTML::Tidy; my $file_name ="d:/perl/test.xhtml"; undef $/; open xhtml_file, '<:encoding(UTF-8)', "$file_name" || die "no htm file + found $!"; my $contents = <xhtml_file>; close (xhtml_file); $/ = "\n"; my $tidy = HTML::Tidy->new(); $tidy->ignore( text => qr/DOCTYPE/, text => qr/html/, text => qr/meta/, text => qr/header/ ); $tidy->parse( "foo.html", $contents ); for my $message ( $tidy->messages ) { print $message->as_string, "\n"; }

And Error Log is :-

foo.html (10:1) Error: <section> is not recognized!

foo.html (10:1) Warning: discarding unexpected <section>

foo.html (11:1) Error: <section> is not recognized!

foo.html (11:1) Warning: discarding unexpected <section>

foo.html (12:1) Error: <section> is not recognized!

foo.html (12:1) Warning: discarding unexpected <section>

foo.html (16:1) Warning: discarding unexpected </section>

foo.html (17:1) Warning: discarding unexpected </section>

Thanks
Nikhil Ranjan

Replies are listed 'Best First'.
Re: How to parse HTML5?
by Corion (Patriarch) on Mar 08, 2016 at 11:59 UTC

    Maybe <section> tags are not supposed to be nested and that's why HTML::Tidy is complaining about them?

    Why do you want to parse HTML? Personally, I like HTML::TreeBuilder, which gives me a tree I can later query for content. If you want to clean up HTML, maybe you can use the ->as_HTML method of the resulting HTML::Element to pretty-print it.

      could you give me any short example with "HTML::TreeBuilder"

      But i also need error log with "line number" and "column number"

      Then after end user go that line number and then correct it

      Is it possible?

      Thanks
      Nikhil Ranjan

        What kind of errors do you want? If you're after finding malformed HTML, HTML::Tidy is better, because HTML::TreeBuilder will automatically correct much of the HTML.

Re: How to parse HTML5?
by CountZero (Bishop) on Mar 08, 2016 at 16:20 UTC
    HTML::Tidy does not parse HTML 5! (see http://tidyp.com homepage)

    section is an HTML 5 feature.

    Hence, you get errors with HTML::Tidy if you feed it HTML that contains section tags.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

    My blog: Imperial Deltronics
Re: How to parse HTML5?
by dasgar (Priest) on Mar 08, 2016 at 17:52 UTC

    Did a little search and found HTML::Valid. It looks like it supports HTML5 and it looks like it will provide error messages that include the line number and column number that you are looking for.

Re: How to parse HTML5?
by choroba (Cardinal) on Mar 08, 2016 at 12:57 UTC
    Crossposted to StackOverflow. It's considered polite to inform about crossposting, so that people not attending both sites don't waste their efforts hacking a solution for a problem already solved at the other end of the internet.

    ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
      Still this problem is not solved in both end. Okay, And i am not wasting my time as well as your time. I just try to solve my problem. If you can then thanks other wise it's okay.

        Out of curiosity, why do you say that the problem has not been solved?

        Prior to your claim that "this problem is not solved in both end", I posted to both forums (here at PerlMonks and here at Stack Overflow) a suggestion to check out HTML::Valid.

        According to the documentation of HTML::Tidy, you need to have tidyp installed first and tidyp appears to be a fork of tidy and that site indicates that it is the "HTML Tidy Legacy Website". The HTML::Valid module is based on the HTML Tidy project and it does support HTML5.

        And I'll take it a bit further. Here's a demonstration of HTML::Valid on the OP's posted HTML/XHTML data.

        I created a test.html file with the following content (from the OP):

        And here's the Perl code that uses HTML::Valid to check that file:

        And here's the output of that script:

        That shows that HTML::Valid is not having issues dealing with <section> tags and that is also provides line numbers and column numbers as the OP stated here as something that was needed. Unfortunately it looks like HTML::Valid does not have an ignore method that was in the OP's code had that used HTML::Tidy, so the OP may need to write a little bit more code to parse out the messages concerning tags that the OP wants to ignore.

        Unless I totally misunderstood what the "problem" was, it looks like HTML::Valid "solves" the "problem".

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1157066]
Approved by ww
Front-paged by davies
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (6)
As of 2024-04-23 23:56 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found