Hello again haukex,

the thread is interesting and I made my best last night to provide an XML::Twig solution, but due to limited understanding of the XML in general I report here some thing i do not understand about the file you presentend as input.

First I cheated because I get the sample XML file before writing the program, because with XML i always go for a try-and-check path..

Second, in my wide ignorance, I really dont know how XHTML, DTD, DOM and transitional can affect the approach to the XML to parse. My sin.

Third: if XML::Twig (the only module I use for these task) complains about the document I'll use W3C validator to check the content, before crashing my head with the content, task i very dont like.

So, your sample is a valid one. I put it after the __DATA__ token and I got the following error:

no element found at line 2, column 0, byte 39 at D:/ulisse/perl5.26.64 +bit/perl/vendor/lib/XML/Parser.pm line 187. at dontregexXML03.pl line 20.

After half an hour searching the web I ended reading of xpath bugs dated 2009 but no clue at all.

Any attempt to brutally cut the XML, removing lines and tags ended with the very same error, at the same line (??).

So I tested the YourMother's solution with your own modification and I get many errors but also the correct solution:

sample.html:11: HTML parser error : Element script embeds close tag console.log(' <div class="data" id="Hello">World</div> '); ^ sample.html:49: HTML parser error : htmlParseStartTag: invalid element + name <![CDATA[ ^ sample.html:50: HTML parser error : Unexpected end tag : div <div class="data" id="Bye">Bye</div> ^ Zero=, One=Monday, Two=Tuesday, Three=Wednesday, Four=Thursday, Five=F +riday, Six=Saturday, Seven=Sunday

So i assumed the XML had some problems effectively: my others attempts to fix it using such detailed reports emitted by XML::LibXML had no more luck that previous ones.

As last resource i put the XML sample into a separate file and: TADA' all run smooth (not considering the &nbsp issue) with XML::Twig as presented above.

Any suggestion? Which is the best module to report formal errors in the XML structure? are the above reported errors real ones or are due to limits of the parsing module?

If the thread will continue can be the Rosetta of Perl XML parsing. Goood one!

L*

There are no rules, there are no thumbs..
Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.

In reply to Re^3: Parsing HTML/XML with Regular Expressions (validation of the content) by Discipulus
in thread Parsing HTML/XML with Regular Expressions by haukex

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.