I would add one caveat - if you need to handle many many files, and your file format is fixed, regular expressions will be much faster than parsing the XML. I got burned by going the virtuous route on a set of 60K Reuters wire stories - there was an order of magnitude speed difference between regular expressions and XML::Parser.
I have found this to be a tradeoff with many XML tools - the right way to do it tends also to be slow, resource intensive, or both. XSLT comes to mind. It is frustrating, but hopefully a temporary growing pain.
| [reply] |
As far as speed goes, I think you'll find that XML::LibXML is the fastest XML parser on the block and somewhat preferable than XML::Parser.
--
vek
--
| [reply] |
Hear, hear.
The few times I have had to deal with XML, I find that people tend to pay lip
service to it, and manage to emit badly formed XML far more often than they get
it right. Lone & characters in text being the worst offense. In order to use
XML parsing tools, you first have to run a cleanup script over the received data
so that the tools don't curl up and die.
Furthermore, the XML in question is usually being emitted from an old program
that has been modified to produce XML today, when in the past it was producing
plain old data. By extension, it means that XML you get to deal with has a rigid
structure, not at all free-form as the spec might make you think.
I would hazard a bet and say that the majority of XML used is to get one system
to speak to another system. I would guess that the number of instances where one
system has to deal with incoming XML instance from multiple sources is quite
small in comparison.
If you are in the position of getting data from one system to another
you usually have control over how and when the format is changed. When you have that
much control over the environment, simple methods suffice.
For instance, to paraphrase some old code I have, you can get a lot of mileage
out of Perl's wonderful ... operator (not to be confused with ..).
#! /usr/bin/perl -w
use strict;
my @stuff = grep { /<emp>/ ... /<\/emp>/ } <DATA>;
__DATA__
<profile>
<emp>
<name>Mahesh</name>
<age>24</age>
<address>New york</address>
<desig>Developer</desig>
</emp>
</profile>
<junk>
<morejunk />
</junk>
<profile>
<emp>
<name>Mahesh2</name>
<age>242</age>
<address>New york2</address>
<desig>Developer2</desig>
</emp>
</profile>
You might ask what happens when a new element is added. Well, surprise!
you will be obliged to modify your script that parses XML too, if you want to
do anything with it.
Don't get me wrong, I am a big fan of XML, but I think it suffers from too
much hype. People seem to be happy to use it even when simpler methods
exist.
print@_{sort keys %_},$/if%_=split//,'= & *a?b:e\f/h^h!j+n,o@o;r$s-t%t#u' | [reply] [d/l] |