Re: parsing XMLish data

Another technique I've used (only for outputting XML with XML::Writer though) is to filter the data. In my case I tied the Handle XML::Writer was using. That tie filtered the data to make it XML compliant.

So how would that work here you ask?

Loosely thinking, I think you'd write a tie or come up with your own IO class to filter the input. You'd then have XML::Parser read from this handle. Using XML::Parser handlers, you'd recognize when you were in and out of your data tags and make a call to the tie or custom handle to tell it to filter the data. That filter becomes pretty trivial I think: You convert the angle brackets and ampersands to XML character entities and you do whatever you need to with 8-bit data that doesn't fit.

The power here (IMO) is that you're separating the filtering of HTML from the parsing of XML. You can blow up that HTML parsing independently as needed, again using existing tools like HTML::Parser. The problem with rolling your own is that it seems simple until you hit all the exceptions. Read all the Perl docs on why not to parse your own HTML as an example. Anything beyond the character-by-character filtering I describe above will fail miserably as the data changes to have tags spanning lines, nested tags, etc.

Contrary to something you said earlier, you don't necessarily need to know all your tag names to do this. But there has to be some predictability to your documents for you to write any parser. So don't confuse what you think XML::Parser needs with this general requirement -- I think you'll go about as far there rolling your own as you can with an existing, robust parser.

Filtering and using standard interfaces is an approach I prefer. It fits that UNIX-like philosophy of not reinventing the wheel and using existing tools as filters to coerce things into models that are predictable.

In short, leverage the work of others into the problem at hand by focusing only on the exceptions unique to your case.

Comment on Re: parsing XMLish data