Another technique I've used (only for outputting XML with XML::Writer though) is to filter the data. In my case I tied the Handle XML::Writer was using. That tie filtered the data to make it XML compliant.

So how would that work here you ask?

Loosely thinking, I think you'd write a tie or come up with your own IO class to filter the input. You'd then have XML::Parser read from this handle. Using XML::Parser handlers, you'd recognize when you were in and out of your data tags and make a call to the tie or custom handle to tell it to filter the data. That filter becomes pretty trivial I think: You convert the angle brackets and ampersands to XML character entities and you do whatever you need to with 8-bit data that doesn't fit.

The power here (IMO) is that you're separating the filtering of HTML from the parsing of XML. You can blow up that HTML parsing independently as needed, again using existing tools like HTML::Parser. The problem with rolling your own is that it seems simple until you hit all the exceptions. Read all the Perl docs on why not to parse your own HTML as an example. Anything beyond the character-by-character filtering I describe above will fail miserably as the data changes to have tags spanning lines, nested tags, etc.

Contrary to something you said earlier, you don't necessarily need to know all your tag names to do this. But there has to be some predictability to your documents for you to write any parser. So don't confuse what you think XML::Parser needs with this general requirement -- I think you'll go about as far there rolling your own as you can with an existing, robust parser.

Filtering and using standard interfaces is an approach I prefer. It fits that UNIX-like philosophy of not reinventing the wheel and using existing tools as filters to coerce things into models that are predictable.

In short, leverage the work of others into the problem at hand by focusing only on the exceptions unique to your case.


In reply to Re: parsing XMLish data by steves
in thread parsing XMLish data by gav^

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.