Re: X(ht)ML Source Formatting
by diotalevi (Canon) on Aug 13, 2003 at 18:45 UTC
|
I once wrote a batch XML (and XHTML is XML) indenter and got some nice replies for even better tools. See Light batch XML indenter for the scoop.
| [reply] |
|
|
++ thank you. That seems like a nippy way of doing it (compared to creating a HTML parse tree, anyway). It might also help me get over my phobia of the /x regex modifier.
However, I'm going to be an XHMTL pedant and point that there's a few things it doesn't handle correctly. By correctly, I mean, the end result isn't identical, from an XML parser's point of view, with the start.
- It needs to leave CDATA sections alone. In XHTML Strict, SCRIPT and STYLE sections may declare the content to be unparsed character data. This is useful because it allows you to have '<' and '>' in your scripts (eg the Javascript comparison operators) and styles (CSS contextual selectors) without having to escape them.
- It shouldn't touch the whitespace at all within PRE elements; inside these, whitespace should be taken as literally in the file, and not closed up. For example, this will come out wrong:
<pre> <span id="foo">foo</span> </pre>
Sorry, I don't want to detract from a really nice piece of work; I can see that it would definitely be useful in more data-oriented XML settings. However it's not really accurate enough for me to use in a production setting.
cheers
ViceRaid | [reply] [d/l] |
|
|
Pls forgive if I don't understand the problem domain too well, but XML::Twig may suit(?)
| [reply] |
|
|
|
|
|
|
Oh for sure. Definately. I didn't even attempt to "parse" the XML or do anything besides handle the most generic of tasks. I don't think its even valid to talk about CDATA or PRE elements or anything that requires knowledge of actual XML or XHTML. In this case the *only* thing it respects are the '<', '/' and '>' characters. It was one of those things a person writes when its 9pm, you're still at work and you've still got to pack for a trip tomorrow at 7am. An ugly scene all around.
| [reply] |
Re: X(ht)ML Source Formatting
by revdiablo (Prior) on Aug 13, 2003 at 19:04 UTC
|
I'm not sure if you want to home-brew something simple (which would give you exactly the results you wanted, rather than having to endlessly tweak someone else's output), but I'll describe how I would approach the problem.
I have, in the past, written a script or two to get some XML/XHTML/HTML formatted more acceptibly. The basic algorithm I use is pretty simple. First, just tokenize the text stream based on tags, and keep a simple counter of what level we're at. An opening tag increments the counter, a closing tag decrements it. Then you print that tag, prefixed by the appropriate amount of indentation. Of course, with plain HTML you must be careful of tags that do not commonly have a closing tag (such as a, p, img), but with well-formed XML/XHTML you do not have to worry as much (other than to watch out for single tags that open and close themselves, like <br/>).
Update: scooped by diotalevi. I guess that's what happens when you have to walk away for 5 minutes to talk to your boss. :) I should note that diotalevi's code pretty much does exactly what I describe here, so maybe my post will still be a useful plain-english explanation? (Grasping at straws here.)
| [reply] [d/l] |
|
|
Thanks - I wouldn't mind home-brewing a tweak to HTML::Element, and I've already got all the bits nicely tokenised for me so I don't have to worry about that. As I mentioned above in reply to diotalevi, it's not quite as simple as we'd wish; on the other hand, you've made me think it's not quite as hard as I'd feared. I guess I might have some trouble justifying spending my work hours scratching my source-code formatting neuroses, but this problem's bitten me now...
cheers
ViceRaid
| [reply] |
Re: X(ht)ML Source Formatting
by koku (Initiate) on Aug 15, 2003 at 02:54 UTC
|
# syntax:
# $h->as_HTML($entities, $indent_char, \%optional_end_tags)
print $h->as_HTML('<>', ' ', {});
Output is indented by specifying $indent_char. In this case the HTML is indented with two spaces.
ko | [reply] |
|
|
Yeah, the as_HTML method of HTML::Element does produce nicely formatted output, but the output is HTML rather than XHTML (which is HTML expressed in XML). Take a look at the W3C MarkUp pages for details of the differences. It's things like having to use lower-case, quote attributes, close tags, like:
<img src="pic.gif" alt="nice picture" />
closing the tags, instead of:
<IMG src=pic.gif alt="nice picture">
which is acceptable HTML, but not XHTML.
HTH
ViceRaid | [reply] [d/l] [select] |
|
|
Sorry, saw that you were using HTML::TreeBuilder, and assumed you were mainly concerned with indenting.
Actually, the modules take care of most of what you want including:
- makes sure there are no improperly nested elements
- automatically lowercasing element and attribute names.
- closes all tags, if you pass is an empty hashref to the as_HTML() method (\%optional_end_tags).
- quotes attributes
But you still have to deal with closing empty elements like <br> which you could do fix like this (you'll have to play around with trying to fix <img> and others):
use strict;
use HTML::TreeBuilder;
my $root = HTML::TreeBuilder->new;
my $html = $root->parse_file('a.htm');
my @br = $html->look_down('_tag','br');
my $literal = HTML::Element->new('~literal','text' => '<br />');
foreach (@br) {
$_->replace_with($literal)->delete;
}
print $html->as_HTML('<>', ' ',{});
The line with $literal is kind of a kludge, I don't know if it will break the tree (shouldn't because these types of elements should be empty...
HTH - ko | [reply] |
Re: X(ht)ML Source Formatting
by uwevoelker (Pilgrim) on Aug 15, 2003 at 09:47 UTC
|
Hello,
for XML-formatting I use XML::Twig
my $twig = XML::Twig->new(pretty_print => 'indented');
$twig->parse($text);
$text = $twig->sprint;
It works great!
| [reply] [d/l] |