mush4brains has asked for the wisdom of the Perl Monks concerning the following question:

Yes, "XML". I know this isn't XMLMonks.org, but gosh-darnit, I can't think of a group of experts I'd rather ask about this... and I can't think of a better language to code the solution in...
So -- anyone aware of existing applications/scripts to "fix" non-well-formed XML, where the only sin is overlapping elements? (I.e., otherwise well-formed)
For example, given the following XML snippet:
<A><B>...</A></B>
I'd like to switch the end-tags. In this example, A and B have precisely the same content, so this fix is straightforward and well-defined. (I'm ignoring DTD/schema enforcement here.) Generalizing a bit, these can nest deeper and get less trivial:
<D><A><B><C>...</D></C></B></A>
...or (shudder)...
<D><A><B><C>...</D></A></C></B>
Note all begin-tags have a corresponding end-tag.

Also, I dream of one day finding an elegant way to handle non-trivial overlapping elements, call them "pseudo-elements". I know none of the current standards (XML parsers, XPath/XSL/...) could handle that, naturally, so it wouldn't be as useful with them. This is a much harder problem, and one I think XML-ers just consider "out-of-scope". But a guy can dream, can't he?

Anybody got advice here? Any modules that might help?
Thanks,
Jim W.

Replies are listed 'Best First'.
Re: Fixing ill-formed XML
by davorg (Chancellor) on Dec 19, 2002 at 22:28 UTC

    If I was feeling pedantic I'd point out that there's no such thing as "ill-formed XML". It's either XML (in which case it's well-formed) or it isn't XML.

    But seriously, your best bet is to got back to the source and fix that. If it's a program that is in your control then fix the bugs in it. If it's an external data source then go bakc to the suppliers and point out that they aren't sending you XML as they said they would.

    Whoever decided that browsers would "do their best" with invalid HTML was responsible for the nightmare of non-validating web pages that we have at the moment. XML is supposed to (partly) be a reaction to that. You can't be lenient with non-XML that claims to be XML. That way lies madness.

    --
    <http://www.dave.org.uk>

    "The first rule of Perl club is you do not talk about Perl club."
    -- Chip Salzenberg

      Whoever decided that browsers would "do their best" with invalid HTML was responsible for the nightmare of non-validating web pages that we have at the moment.

      It may seem anoying sometimes, but regardless of how forgiving browsers may be, people (and machines) will allways be faliable, and will allways make mistakes. Would you prefer it if browsers just refused to show any page with the slightest HTML glitch?

      Ultimately, the choice to "do their best with invalid HTML" can be traced back to the defacto mantra of the IETF...

      Be liberal in what you accept, and
      conservative in what you send.
      

      For more background, consult RFC791, RFC1122, and RFC1958,

        Would you prefer it if browsers just refused to show any page with the slightest HTML glitch?

        Absolutely! If browsers didn't need to parse broken HTML, emulate stupid behaviours of old browsers and all that stuff then they would be smaller, faster, more stable and much easier to develop for.

•Re: Fixing ill-formed XML
by merlyn (Sage) on Dec 19, 2002 at 22:49 UTC
Re: Fixing ill-formed XML
by mirod (Canon) on Dec 19, 2002 at 23:05 UTC

    I'll second davorg's advice: try to fix the source. If you try to fix it yourself you will have to make assumptions about what you get (starting with "even if tags are mixed there is just one way to make sense of it"), and one day these assumptions will not be true, your XML processing chain will be hosed, usually at the worst possible time... and you'll be in a lot of trouble.

    That said... maybe tidy can make sense of it and spit out proper XML, especially if you are working on some kind of HTML-based-not-quite-XML. Look especially at Teaching Tidy about new tags! in the doc

      Thanks for the replies, everyone. I'm no XML purist, but I agree "that ain't XML".
      However, the pragmatist in me believes there are circumstances that could benefit from an "XML patch" utility... Suppose I'm using multiple applications that are well-intended but not terribly well-behaved (and out of my control) that contribute markup that incorrectly nests new elements with existing elements. How do I handle the tag soup?
      E.g., fairly well-defined nesting issues, as in my original note:
      <a><b> this </a></b>
      I know "heavily overlapped" elements are very problematic and have no straightforward solution:
      <a> this <b> that </a> the other </b>
      However, "trivially overlapped" elements should be much easier to handle:
      <a> this <b></a> that </b>
      I've begun looking at HTML-Tidy and it can handle some obvious nesting and overlap issues, so far, though it's clearly more HTML-oriented (with some XML support).

      Generally, whether HTML-Tidy is able, I seek a utility that can "fix" these well-defined nesting issues (ideally, it would use given tag priorities to indicate which should be ancestors to which descendants) and trivially-overlapped elements. And if the errors are worse/unfixable, the utility gives up.
      Thanks again for your indulgence, mighty monks.
      - Jim W.
Re: Fixing ill-formed XML
by CountZero (Bishop) on Dec 19, 2002 at 22:56 UTC

    Isn't there a module called HTML-Tidy (or something similar) which can clean up ill-formed HTML?

    If I remember well, it could unmix mixed-up HTML-tags. Perhaps it can give you some pointers how to do it.

    CountZero

    "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

      CountZero++! However, HTML Tidy is not a Perl module, it's a stand alone application that corrects common HTML errors. You can get a copy here. The technical term for such a piece of software is a "Lint". Don't ask me, I don't know where they got the name.

        Thank you Ionizor. I see that there is a PERL-wrapper for HTML-Tidy, so that must have been mixed up in my memory.

        CountZero

        "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law