Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Preferred Methods (again)

by vek (Prior)
on Jan 17, 2002 at 00:45 UTC ( [id://139329]=perlquestion: print w/replies, xml ) Need Help??

vek has asked for the wisdom of the Perl Monks concerning the following question:

Esteemed Monks,

Here's another one of those "the code works" but "is there a more efficient way of doing it". My definition of efficiency in this particular case would be speed.

The task is to parse attributes from the Root element of some XML (without launching another parser, XML::Parser et al). Here's a brief snippet of the XML in question:

<?xml version="1.0"?> <!DOCTYPE Root SYSTEM "/path-to/theDtd"> <Root Id="456990" Group="Navy" TimeStamp="20020116123446" Performance= +"Regular" Database="gyt98x" Project="x"> <Request> <DataForce>Premium</DataForce> </Request> </Root>
Using the following code to obtain the attributes:
sub parseRoot { my $xmlIn = shift; $_ = ${$xmlIn}; my %rootAttr; while (m#(\S+?)="(\S+?)"#g) { ($1 eq 'version') && next; #skip the xml version $rootAttr{$1} = $2; } return \%rootAttr; }

Replies are listed 'Best First'.
Re: Preferred Methods (again)
by seattlejohn (Deacon) on Jan 17, 2002 at 01:15 UTC
    Are you trying to parse real XML, or just to build a tool that can handle some small subset of it? The reason I ask is that actual XML can contain a lot of oddities that your code above will not handle property. For instance, XML allows attribute-value pairs to have spaces around the equal sign, like this:
    <Root id = "456990">

    It looks like your code would choke on that, though. Unless you are absolutely sure that your input data, now and forever, will not contain anything but the ultra-strict subset of XML that your code will support, I would urge you to use XML::Parser. (I believe its internals are written in C, so it's actually quite fast; have you benchmarked it on your specific documents to see if it will meet your needs?)

    I know you mentioned your criterion for efficiency is execution speed, and that you don't want to use a separate parser, so maybe I should just butt out. It's just that years of working with HTML and more recently XML have taught me to be extremely cautious. Building a parser that really respects the specs is a non-trivial task, and I'd hate to see fragile code go into production and then have to be torn out later for maintenance, when a perfectly good module is already available to do the task you intend.

      Just build a quick tool to handle a small subset.

      The XML in question is generated by an in-house Java app written by a co-worker. The Root attribute format is hardcoded so for this application only I'm confident that the format will not change. I'm well aware of the pitfalls of regex parsing and would not (and in fact do not in other code) dream of doing that when parsing XML from another source.

      In my reply to juerd I mentioned that this code runs on a 'gateway' box. That box just accepts XML from a socket connection, archives the XML and then forwards it on to the database box for real parsing via XML::Parser(including the handling of base64 encoded print images & other fun stuff). Therefore the only thing this code needs to do is to be able to identify the type of XML message - as specified by the 'Group' attribute of Root so that the XML can be archived correctly.

      The intent of my post (and I know I should have clarified it) was really to have people comment on the regex. I didn't mean to start a war over whether or not you should use a parser or not.
Re: Preferred Methods (again)
by mirod (Canon) on Jan 17, 2002 at 02:59 UTC

    So your requirements are that you want to parse XML (possibly produced in-house) and that you want it to be fast,especially as you are only interested by the root tag of the document.

    Would these be satisfied by using a full-fledged XML parser, but by only parsing until the first open tag?

    If yes then a pull parser is what you are looking for. Just pull the XML (including potentially a DTD, silly comments and tons of PIs) until you find the tag... and then stop! XML::Parser has a pull mode, but I found that XML::TokeParser is actually much easier to use in this case:

    #!/bin/perl -w use strict; use XML::TokeParser; #parse from open handle my $p=XML::TokeParser->new(\*DATA,Noempty=>1); #skip to <Root...> my $tag= $p->get_tag('Root'); my $atts= $tag->[1]; # that's where the atts are, look at the docs while( my( $att, $value)= each %$atts) { print "$att -> $value\n"; } __DATA__ <?xml version="1.0"?> <!DOCTYPE Root SYSTEM "/path-to/theDtd"> <Root Id="456990" Group="Navy" TimeStamp="20020116123446" Performance= +"Regular" Database="gyt98x" Project="x"> <Request> <DataForce>Premium</DataForce> </Request> </Root>

    2 side notes:

    • Waouh! I am impressed by the number of "use a real parser" answers to this node! Good! I can now officially retire from my position as XML Ayatollah extraordinaire ;--)
    • this is the third time this week that the easiest answer to an XML question was not one of the most popular XML module (if "most popular" means SAX/DOM/XPath or even Twig). Process data from a file extracted from a relational table? XML::RAX was designed for this. Turn all elements or attribute names to upper case? XML::PYX did it in one line. I take this as a testimony to the strength of TIMTOWDI and as kind of a lesson against using only one tool for any job (and I see SAX being advocated as such a little too often for my liking). Know your modules (or ask here ;--) and live happy!
      Thanks for the suggestion.
Re: Preferred Methods (again)
by Juerd (Abbot) on Jan 17, 2002 at 01:04 UTC
    The preferred method for XML parsing is using a module. If you don't want to use a module, you can be sure that whatever method you'll be using is not the preferred.

    2;0 juerd@ouranos:~$ perl -e'undef christmas' Segmentation fault 2;139 juerd@ouranos:~$

      I should have clarified the post. The actual "meat and potatoes" parsing is indeed done via XML::Parser on a seperate box. A gateway server receives the XML via a socket connection. This server has to archive the XML into a file. The filename is determined by a couple of the Root attributes. Hence only the Root attributes are needed by the gateway box - it has no need to parse the entire XML 'message'. The 'gateway' server then forwards the XML message, again over a socket connection to the database box - all heaving duty parsing/processing etc is done there.
Re: Preferred Methods (again)
by perrin (Chancellor) on Jan 17, 2002 at 01:17 UTC
    Why loop?
    # trim to just the Root node $xmlIn =~ s/^.*(<Root.*?>).*$/$1/s; # grab the key/value pairs %rootAttr = ($xmlIn =~ m#(\S+?)="(\S+?)"#g);
    (Untested. Not certain that second regex returns a list. Might need to be in a loop after all.)
      <!-- notroot: <Root> -->
      Or
      <root><foo/><root><bar/></root></root> <!-- Yes, your regex will take +the second <root> -->


      If parsing XML data could be done with a simple regex, those modules would probably not exist.

      2;0 juerd@ouranos:~$ perl -e'undef christmas' Segmentation fault 2;139 juerd@ouranos:~$

        If parsing XML data could be done with a simple regex, those modules would probably not exist.

        It's possible. Just that it doesn't have any error checking.
        See: Parsing pseudo XML files

        BMaximus
        Get off your high horse about XML compliance. He gave a sample input format and asked how to grab pieces of it. If he changes the input format or wants it to handle broken input, he has to change the way he parses. That's true with an XML parser too.
Re: Preferred Methods (again)
by tradez (Pilgrim) on Jan 17, 2002 at 02:18 UTC
    I agree with the first response node. If you are trying to parse XML, and not use XML::DOM or something like it, it is probably not going to be the preferred manner. Don't be afraid of CPAN my fellow monk! using the DOM element method and just using snippets like
    $node = getFirstNode();
    type stuff for traversing through a XML Doc is the closest thing we as programmers have to a protocol/standard on the dealings with XML. tradez
Re: Preferred Methods (again)
by mirod (Canon) on Jan 17, 2002 at 03:47 UTC

    I guess the only comments I would have on the regexp would be to sprinkle it with \s* and that "(\S+?)" is a weird way to capture the content of an attribute:

    m#\s(\S+)\s*=\s*"([^"]*)"#g

      "(\S+?)" is a broken way to capture an attribute. (Hint: what happens if an attribute contains whitespace chars?) Consider using "([^"]+)" instead. Even better, consider profiling to make sure that using a proper XML-parsing module (whose author has already gone looking for this sort of bug) is enough of a slow-down to merit going to hard regex-based chunking.

      Update: Yeah, if you care about empty attributes (debatable; I usually don't), "([^"]*)" is the way to go. Thanks Matts!

      --
      :wq
        That would be "([^"]*)", otherwise you miss empty attributes!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://139329]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others musing on the Monastery: (5)
As of 2024-03-28 19:32 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found