LexPl has asked for the wisdom of the Perl Monks concerning the following question:

Is there an easy way to indent an unindented XML file?

Of course, significant whitespace between two adjacent elements or in mixed content should not be obscured or lost

The data have an encoding ISO-8859-1 and contain tons of XML entities defined in a DTD with external entity declarations in a subdirectory

sample input
<?xml version="1.0" encoding="ISO-8859-1"?><root><para><pnum>1</pnum><ptext>This is a sample of a very specific text <emph>called</emph> <term>description</term> which has 2 subtypes: <list><item><term>1.</term><p>precise</p></item><item><term>2.</term><p>fuzzy</p></item></list></ptext></para></root>

The output should look like this:

<?xml version="1.0" encoding="ISO-8859-1"?> <root> <para> <pnum>1</pnum> <ptext>This is a sample of a very specific text <emph>called</ +emph> <term>description</term> which has 2 subtypes: <list> <item> <term>1.</term> <p>precise</p> </item> <item> <term>2.</term> <p>fuzzy</p> </item> </list> </ptext> </para> </root>

Replies are listed 'Best First'.
Re: Indent XML data
by Anonymous Monk on Jan 22, 2025 at 16:34 UTC

    Have you tried xml_pp in the XML-Twig distribution?

      First of all, thanks to all colleagues for your helpful feedback!

      XML::Twig does work, but it doesn't preserve the order of attributes.

      I'm well aware that the order of attributes is typically not significant, as the content doesn't change. In my case, this is an issue because indentation is a preparatory step before diffing so that there would be a large number of false diffs caused by a changed order of attributes. What could I do?

      Besides that XML entities in attribute values are simply deleted (and I could not use UTF-8 in my environment).

      How would that be done?

        xml_pp lists usage examples, linked from the module above...

Re: Indent XML data
by Fletch (Bishop) on Jan 22, 2025 at 16:32 UTC

    I'm sure someone's going to pipe up with one of the XML::* modules that I don't recall offhand but you could use xmllint (which most expat or xmllib installs should come with) and it's --format option to get at least started.

    $ xmllint --format --pretty 1 sample_txt.xml <?xml version="1.0" encoding="ISO-8859-1"?> <root> <para> <pnum>1</pnum> <ptext>This is a sample of a very specific text <emph>called</emph +> <term>description</term> which has 2 subtypes: <list><item><term>1. +</term><p>precise</p></item><item><term>2.</term><p>fuzzy</p></item>< +/list></ptext> </para> </root>

    The cake is a lie.
    The cake is a lie.
    The cake is a lie.

      I already tried that, but it doesn't work with my XML entities although my dtd is in the same directory and the entity declarations are in a subdirectory (detailed in the dtd).