Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??
Hello again haukex,

the thread is interesting and I made my best last night to provide an XML::Twig solution, but due to limited understanding of the XML in general I report here some thing i do not understand about the file you presentend as input.

First I cheated because I get the sample XML file before writing the program, because with XML i always go for a try-and-check path..

Second, in my wide ignorance, I really dont know how XHTML, DTD, DOM and transitional can affect the approach to the XML to parse. My sin.

Third: if XML::Twig (the only module I use for these task) complains about the document I'll use W3C validator to check the content, before crashing my head with the content, task i very dont like.

So, your sample is a valid one. I put it after the __DATA__ token and I got the following error:

no element found at line 2, column 0, byte 39 at D:/ulisse/perl5.26.64 +bit/perl/vendor/lib/XML/Parser.pm line 187. at dontregexXML03.pl line 20.

After half an hour searching the web I ended reading of xpath bugs dated 2009 but no clue at all.

Any attempt to brutally cut the XML, removing lines and tags ended with the very same error, at the same line (??).

So I tested the YourMother's solution with your own modification and I get many errors but also the correct solution:

sample.html:11: HTML parser error : Element script embeds close tag console.log(' <div class="data" id="Hello">World</div> '); ^ sample.html:49: HTML parser error : htmlParseStartTag: invalid element + name <![CDATA[ ^ sample.html:50: HTML parser error : Unexpected end tag : div <div class="data" id="Bye">Bye</div> ^ Zero=, One=Monday, Two=Tuesday, Three=Wednesday, Four=Thursday, Five=F +riday, Six=Saturday, Seven=Sunday

So i assumed the XML had some problems effectively: my others attempts to fix it using such detailed reports emitted by XML::LibXML had no more luck that previous ones.

As last resource i put the XML sample into a separate file and: TADA' all run smooth (not considering the &nbsp issue) with XML::Twig as presented above.

Any suggestion? Which is the best module to report formal errors in the XML structure? are the above reported errors real ones or are due to limits of the parsing module?

If the thread will continue can be the Rosetta of Perl XML parsing. Goood one!

L*

There are no rules, there are no thumbs..
Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.

In reply to Re^3: Parsing HTML/XML with Regular Expressions (validation of the content) by Discipulus
in thread Parsing HTML/XML with Regular Expressions by haukex

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (4)
As of 2024-04-24 02:19 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found