markww has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I made a small script which generates some tag-like output (wouldn't call it xml) and it looks like this, all on one ugly line:
<circle>wonderful<apple color="red">very good</apple></circle>
After the output is written to a file, I wanted to open it again and format it so it's more readable (ie. 'beautify' it):
<circle> wonderful <apple color="red"> very good </apple> </circle>
I tried using the simple XML parser, but it reorders my tags in alphabetical order, and it also replaces my "id" attributes with "name"! I just need some basic indentation support.

I found on CPAN: http://search.cpan.org/~bjoern/SGML-Parser-OpenSP-0.994/lib/SGML/Parser/OpenSP.pm which I *think* will parse my sgml stuff and maybe dump it nicely in the indented format I want, but I cannot get it to build on my poor mac.

Anyone know of an easier way, or if that library above is the one to go with?

Thanks

Replies are listed 'Best First'.
Re: Beautifying some SGML?
by JavaFan (Canon) on Dec 04, 2008 at 13:13 UTC
    Note that your reformatting isn't necessary equivalent; in the reformatted example, you have introduced whitespace, which may be significant it can certainly be significant in HTML. Compare:
    Hello, <b>w</b>orld
    with
    Hello, <b> w </b> orld
    The former will typically be rendered by a browser as two words; the latter as three.

    A very simplistic beautifier is: s!(/?>)!\n$1!g. It doesn't do indentation, but it does fold long lines, and, because it inserts the newline inside a tag it doesn't introduce whitespace in the data. Of course, if you have > symbols in PCDATA content (SCRIPT and STYLE elements, and attribute values), this may break things.

Re: Beautifying some SGML? (XML)
by toolic (Bishop) on Dec 04, 2008 at 14:05 UTC
    In general, to pretty-print XML, I have used XML::Tidy and xml_pp (which is part of XML::Twig). Both Do What I Want in most cases.

    However, I have discovered a corner case in which neither DWIW: For one type of Complex Elements in which elements that contain both other elements and text, I can't figure out how to get either module to indent the child element.

    Your example ML falls into this category. If I use xml_pp on your text, it does not indent the "apple" tags, as I would expect. Perhaps the module's author could comment on this.

      It is a feature. As JavaFan mentioned before, adding whitespace is a bit risky, as you have to make sure that it is non-significant. So xml_pp adds line returns and indentation in-between tags if there is no other data in the element. This is not guaranteed to be perfectly safe, but as in general the DTD is unavailable, that's about the best it can do. As soon as there is non-whitespace data in the element, then no indentation is added. Because according to the XML spec, in that case the whitespaces ARE significant.

      It would be possible to add extra options to control more precisely the behavior of xml_pp, but that would be quite tricky, and error prone. At this point it's easier to write a custom pretty printer for your data. At least it's easier for me! ;--)

Re: Beautifying some SGML?
by mirod (Canon) on Dec 04, 2008 at 14:52 UTC

    If your data is a tag soup, you won't be able to use XML or SGML tools. Actually, SGML tools are pretty dangerous in that case, because they might try harder to cope with your data and infer missing tags for example, which might not be what you want at all. Plus you would need a DTD, which you don't mention to have.

    In you case your best bet is a simplistic regexp base tool. I'd try something like this one-liner:

    perl -p -e's{<(.)}{ $ln= $level ? "\n" : ""; if( $1 eq "/") { $level--; } ; $indent= "  " x $level; if( $1 ne "/") { $level ++ }; $ln . $indent . $&; }eg' <filename>

    It will work only if you don't have "compact" tags (<foo/>. And no CDATA section or any oddity of that sort. But it should be OK otherwise. As I said, you don't have XML data, so you don't get to play with all the cool XML toys ;--(

Re: Beautifying some SGML?
by JadeNB (Chaplain) on Dec 05, 2008 at 03:46 UTC
    Others have mentioned Perl-based solutions, and also warned about the danger of using XML or SGML tools on data that isn't guaranteed to be either; but, if you know you're producing XML and aren't too committed to using Perl, would xmllint do what you want?