While I agree with the other messages already in this thread that an "XML" document is not really XML if it doesn't have the close tags, I can offer one interesting solution beyond "roll everything yourself".

XML::Parser and all of its applications will rightfully barf on such a file. However, you may use HTML::Parser in "xml mode" to assist you with the rewrite.

Set up a default handler that just prints the text. Override the start-tag handler to print the text, but push the tag in a stack. Override the end-tag handler to match the end tag to the top of the stack. If they don't match, print an end tag, pop the stack, and repeat until they match. At eof, pop the stack to its completion.

That way, the output will be guaranteed to be properly stacked. It can't handle nested similar tags, but in the absence of a DTD, that's probably the best you can do.

I wrote a Parse::RecDescent tool into which you could feed an SGML-like DTD (with tag minimization), and it would automatically generate the right number of close tags at the right place by brute force. But it was far too slow for any serious work.

-- Randal L. Schwartz, Perl hacker
Be sure to read my standard disclaimer if this is a reply.


In reply to •Re: XML tags by merlyn
in thread XML tags by matth

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.