comment on

That's a very dangerous path you are takin here: trying to process XML with regexps. On XML parsing gives a bunch of reasons on why you should not do it in general, but here is a little test:

<?xml version="1.0"?>
<doc><elt>a regular elt with a > in it</elt>
<pre>  spaces are significant in this element
       as well as line   returns</pre>
<elt att="this is valid >" />
<!-- <elt>commented out</elt><elt>(I mean all of it)</elt> -->
<elt2><sub>if a \n is inserted before the sub element then
the document is still well-formed but not valid anymore,
as the DTD is <![CDATA[<!ELEMENT elt2 (#PCDATA|sub)>]]></sub></elt2>
<elt><![CDATA[<toto><tata>booh<tutu>]]></elt>
<elt>text with an <sub>embedded</sub> element</elt>
</doc>
[download]

gives the following output:

<?xml version="1.0"?>
 
<doc>
        <elt>a regular elt with a > in it</elt>
        <pre>  spaces are significant in this element       as well as
+ line   returns</pre>
        <elt att="this is valid >" />
        <!-- <elt>commented out</elt>
        <elt>(I mean all of it)</elt> -->
        <elt2>
                <sub>if a \n is inserted before the sub element then t
+he document is still well-formed but not valid anymore, as the DTD is
+ <![CDATA[<!ELEMENT elt2 (#PCDATA|sub)>]]>
                </sub>
        </elt2>
        <elt>
                <![CDATA[<toto>
                <tata>booh<tutu>]]>
                </elt>
                <elt>text with an <sub>embedded</sub> element</elt>
</doc>
[download]

Your tool does OK in a lot of situations, except:

the comment is formatted too, no big deal
the formatting in the pre element is broken, which can be very annoying
the CDATA section breaks the formatting
potentially the most dangerous, depending on how you work with XML, is that the valid original document is now invalid, as the \n before the <sub> element in elt2 is significant. This kind of error can be a nightmare to track.

And I am not even talking about problems with documents in different encodings, which could trip your regexps...

The only safe way to break an XML document without knowing its DTD is to put the breaks in the only place where they cannot be significant: within the tags!

That might not be pretty but it is readable:

<?xml version="1.0"?>
<
doc><
elt att="val">you can also break between the tag and the attribute and
+ between attributes</elt></doc>
[download]

By the way, there are a number of modules on CPAN that do pretty printing of XML documents, such as XML::Handler::YAWriter or XML::Filter::Reindent but I have not tested them and from reading the docs I am not sure they are what you are looking for (they are probably too slow and quite complex). But at least they would read the XML properly.

In reply to Re: You have xml files where this formatting tool does not work? by mirod
in thread You have xml files where this formatting tool does not work? by LupoX

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.