That's a very dangerous path you are takin here: trying to process XML with regexps. On XML parsing gives a bunch of reasons on why you should not do it in general, but here is a little test:

<?xml version="1.0"?> <doc><elt>a regular elt with a > in it</elt> <pre> spaces are significant in this element as well as line returns</pre> <elt att="this is valid >" /> <!-- <elt>commented out</elt><elt>(I mean all of it)</elt> --> <elt2><sub>if a \n is inserted before the sub element then the document is still well-formed but not valid anymore, as the DTD is <![CDATA[<!ELEMENT elt2 (#PCDATA|sub)>]]></sub></elt2> <elt><![CDATA[<toto><tata>booh<tutu>]]></elt> <elt>text with an <sub>embedded</sub> element</elt> </doc>

gives the following output:

<?xml version="1.0"?> <doc> <elt>a regular elt with a > in it</elt> <pre> spaces are significant in this element as well as + line returns</pre> <elt att="this is valid >" /> <!-- <elt>commented out</elt> <elt>(I mean all of it)</elt> --> <elt2> <sub>if a \n is inserted before the sub element then t +he document is still well-formed but not valid anymore, as the DTD is + <![CDATA[<!ELEMENT elt2 (#PCDATA|sub)>]]> </sub> </elt2> <elt> <![CDATA[<toto> <tata>booh<tutu>]]> </elt> <elt>text with an <sub>embedded</sub> element</elt> </doc>

Your tool does OK in a lot of situations, except:

And I am not even talking about problems with documents in different encodings, which could trip your regexps...

The only safe way to break an XML document without knowing its DTD is to put the breaks in the only place where they cannot be significant: within the tags!

That might not be pretty but it is readable:

<?xml version="1.0"?> < doc>< elt att="val">you can also break between the tag and the attribute and + between attributes</elt></doc>

By the way, there are a number of modules on CPAN that do pretty printing of XML documents, such as XML::Handler::YAWriter or XML::Filter::Reindent but I have not tested them and from reading the docs I am not sure they are what you are looking for (they are probably too slow and quite complex). But at least they would read the XML properly.


In reply to Re: You have xml files where this formatting tool does not work? by mirod
in thread You have xml files where this formatting tool does not work? by LupoX

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.