in reply to Fixing ill-formed XML

I'll second davorg's advice: try to fix the source. If you try to fix it yourself you will have to make assumptions about what you get (starting with "even if tags are mixed there is just one way to make sense of it"), and one day these assumptions will not be true, your XML processing chain will be hosed, usually at the worst possible time... and you'll be in a lot of trouble.

That said... maybe tidy can make sense of it and spit out proper XML, especially if you are working on some kind of HTML-based-not-quite-XML. Look especially at Teaching Tidy about new tags! in the doc

Replies are listed 'Best First'.
Re: Re: Fixing ill-formed XML
by mush4brains (Acolyte) on Dec 23, 2002 at 16:34 UTC
    Thanks for the replies, everyone. I'm no XML purist, but I agree "that ain't XML".
    However, the pragmatist in me believes there are circumstances that could benefit from an "XML patch" utility... Suppose I'm using multiple applications that are well-intended but not terribly well-behaved (and out of my control) that contribute markup that incorrectly nests new elements with existing elements. How do I handle the tag soup?
    E.g., fairly well-defined nesting issues, as in my original note:
    <a><b> this </a></b>
    I know "heavily overlapped" elements are very problematic and have no straightforward solution:
    <a> this <b> that </a> the other </b>
    However, "trivially overlapped" elements should be much easier to handle:
    <a> this <b></a> that </b>
    I've begun looking at HTML-Tidy and it can handle some obvious nesting and overlap issues, so far, though it's clearly more HTML-oriented (with some XML support).

    Generally, whether HTML-Tidy is able, I seek a utility that can "fix" these well-defined nesting issues (ideally, it would use given tag priorities to indicate which should be ancestors to which descendants) and trivially-overlapped elements. And if the errors are worse/unfixable, the utility gives up.
    Thanks again for your indulgence, mighty monks.
    - Jim W.