XML Parsing

JoeJaz has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: XML Parsing by Enlil (Parson) on Apr 24, 2004 at 07:48 UTC
I would really recommend a module for this sort of thing. For example using XML::Simple: use strict; use warnings; use XML::Simple; use Data::Dumper; my $string = do { local $/; <DATA>}; my $ref = XMLin($string); print Dumper $ref; my $num_events = @{$ref->{EVENT}}; print "There are $num_events events listed\n"; __DATA__ <ROOT> <EVENT> <NAME>test2</NAME> <LOCATION>iwu</LOCATION> <TIME>now</TIME> <DATE>today</DATE> <PRIORITY>interest</PRIORITY> <ATTENDEES>a lot</ATTENDEES> <DESCRIPTION> descrip</DESCRIPTION> </EVENT> <EVENT> <NAME>test3</NAME> <LOCATION>hi</LOCATION> <TIME>joe</TIME> <DATE>how</DATE> <PRIORITY>interest</PRIORITY> <ATTENDEES>are</ATTENDEES> <DESCRIPTION> </DESCRIPTION> </EVENT> </ROOT> [download] As for the code you posted you need to add the /s modifier or the .'s will not match the newline characters (will not cross over lines). If it were me and was doing a quick hack I would probably still use XML::Simple, but as for changing your regex to capture multiple matches you might try something like the following: `my @events; while ($page_body =~ /<EVENT>(.?)<\/EVENT>/sg){ push @events, $1 #note $1 might have zero length. }` [download] Also note that it is pointless to have .? at the very start of a regular expression as it will cause a lot of of needless backtracking, and never really match anything, as a regex looks for a pattern anywhere in the string (lest it be anchored) -enlil	[reply] [d/l] [select]
2Re: XML Parsing by jeffa (Bishop) on Apr 24, 2004 at 14:12 UTC
Just a little nitpick. Replace this: `my $string = do { local $/; <DATA>}; my $ref = XMLin($string);` [download] With this: `my $ref = XMLin(\*DATA);` [download] jeffa L-LL-L--L-LL-L--L-LL-L-- -R--R-RR-R--R-RR-R--R-RR B--B--B--B--B--B--B--B-- H---H---H---H---H---H--- (the triplet paradiddle with high-hat)	[reply] [d/l] [select]
Re: Re: XML Parsing by JoeJaz (Monk) on Apr 24, 2004 at 10:55 UTC
Thanks a lot for your advice. I will have a study at the XML::Simple module and see what it has to offer. The note about the .*? is good to know. I surely don't want my code needlessly using CPU cycles. That above code snippet is precicely what I was trying to do. Thanks again. Joe	[reply]
Re: XML Parsing by mirod (Canon) on Apr 24, 2004 at 09:59 UTC
Oh boy! One of those! Again! ;--) First, let's start with the basics: no you won't write a proper XML parser using regexps. See On XML parsing for some of the things that can, and will, trip your code, and why you shouldn't call what you write an XML Parser if it isn't one. Then why can't you install a new module?? Don't you think your time would be better spent learning how to install a module, rather than writing a half-baked sorta-XML parser? If you are worried about distributing the code to people who won't know how to install modules, most likely on Windows, then XML::Parser comes installed with Activestate Perl (it is used by ppm). Use it. Or better yet learn how to use ppm and use a better XML module. And on Unix installing modules is usually easy. If not you can always package an existing pure Perl parser with your code: XML::Parser::Lite for example, or XML::SAX::PurePerl. None of them is a complete XML parser, but they will surely be better than what you will write. And if you prove me wrong and write a complete XML parser in pure perl, then you will get complete and unreserved apologies! (the XML::Parser distribution includes some pretty hairy tests, you can use them).	[reply]
Re: Re: XML Parsing by JoeJaz (Monk) on Apr 24, 2004 at 11:08 UTC
Thank you for your comments and advice. The article that you sent me was an interesting read. Also, thanks for the module references. They are handy to know about. Regarding what you said about the modules, it's not that I am unable or unwilling to install a module, but it is doubtful that I can convince my school to install the appropriate modules onto the CGI server that I would be placing this program on. Thanks again for you help. I really appreciate your time. Joe	[reply]
Re: XML Parsing by graff (Chancellor) on Apr 24, 2004 at 16:02 UTC
The following comments do not represent the "consensus" view among responsible monks at the Monastery. But they are in the spirit of "TIMTOWTDI"... Sometimes, an XML job is really simple, for instance when the job is to read XML data created by some task-specific program that does nothing but put tags around the columns of a particular flat table -- which appears to be what you have in this case. In effect, if you had access to the original flat table (wherever/whatever it may be) before its contents were decorated with XML tags, you wouldn't need to "parse" XML at all; you would just read the table. And sometimes, if the XML module(s) you would like are not installed for the perl interpreter you're using (e.g. on a web server that you don't control), it can be... um, a bit complicated or time consuming to get them installed, or to incorporate one of them into your own script. But if you know that the job is just a matter of stripping tags out of XML-ized flat table, (warning: heresy alert (: ) you probably don't need an XML parser for that. You could read the input like this (not tested): my @tags = qw/NAME LOCATION TIME DATE PRIORITY ATTENDEES DESCRIPTION/; my @events; open( XML, "<datafile.xml" ) or die $!; { local $/ = "</EVENT>"; # input record separator is end-tag while (<XML>) # read one whole <EVENT>...</EVENT> into $_ { my %record = (); for my $t ( @tags ) { if ( m{<$t>\s([^<]+)} ) # capture text following an open tag { $record{$t} = $1; $record{$t} =~ s/\s+$//; # optional: remove trailing space +s } } push @events, { %record }; # @events is an array of hashes } } close XML; # to get back to the data for later use: for my $i ( 0 .. $#events ) { my $href = $events[$i]; # you get a reference to the hash my %rec_hash = %$href; # you can make a local copy of it, or print "Event #", $i+1, ":\n"; print " $_ = $$href{$_}\n" for ( keys %$href ); # just use the has +h ref } [download] Now for the caveats... Your XML data is not simple* (and this kind of simple solution will not work) if the input is not really like a flat table. This would be the case if: an event can have two or more instances of a given tag (e.g. multiple descriptions) a given tag within an event can contain optional or variable nested tags (e.g. if "attendees" included XML-tagged sub-categories like "invited" vs. "present") any of the tags can take optional or variable attributes (e.g. <TIME zone="EST">...) If your input has any of these features, you could elaborate the "non-parser" approach to handle them, but you might soon reach the point of "diminishing returns", where it would have been better to start with an actual XML parsing module.	[reply] [d/l]
Re: Re: XML Parsing by JoeJaz (Monk) on Apr 24, 2004 at 18:50 UTC
That is a really nice piece of code and some good advice. I wasn't aware of that input seperator code which seems like it would be very helpful for this situation. Also, the hash idea that you use probably would be a better solution than what I had previously been wanting to do. I will consider what you said about using a module for this situation, even if I have to try to embed some modules into my code directory and link to them there. Thank you very much for your time. Your code and information has been useful to me. Take care, Joe	[reply]
Re: XML Parsing by blue_cowdawg (Monsignor) on Apr 24, 2004 at 12:57 UTC
To underscore some of what other monks have already told you, let me recommend a book: Perl & XML by Erik Ray and Jason McIntosh. I've only just begun to go through the book but from the "skimming" of it that I've done it has given me loads of ideas already on doing better XML work with Perl. The moral of the story: don't make work for yourself. There are lots of good modules out there for dealing with XML starting with XML::Simple, cpan::XML::Parser and friends.	[reply]
Re: Re: XML Parsing by JoeJaz (Monk) on Apr 24, 2004 at 17:56 UTC
Pehaps I will attempt to build a module structure into my program in such a way that I don't need root to install the modules. Everyone seems to think modules are the way to go. I'll bang away at it. Thanks for your book recommendation and for your advice. Have a nice weekend. Joe	[reply]
Re: XML Parsing by sth (Priest) on Apr 24, 2004 at 22:16 UTC
I would also recommend buying "XML and Perl" by Mark Reihl and Ilya Sterin, published by New Riders. sth	[reply]
Re: Re: XML Parsing by JoeJaz (Monk) on Apr 24, 2004 at 23:30 UTC
Good to know. Both are topics that I would like to learn more about so that book would be a fitting choice. Thanks for your input. Joe	[reply]
Re: Re: XML Parsing by pelagic (Priest) on Apr 25, 2004 at 08:54 UTC
Let me add this url for all those looking for the examples mentioned in the book: errata and download pages pelagic	[reply]
XML Parsing, DOM, SAX and regexp. by exussum0 (Vicar) on Apr 26, 2004 at 01:17 UTC
Everyone else has pointed out XML::Simple, Twig etc.. But here's a reason NOT to use regexp's for this task. Regular expressions work really well on "stuff" that doesn't require bouncing around. Even if you use regular expressions as part of the task, you'll wind up rescanning the same strings over and over again. An actual parser will go from top down, doing all interpretation at best, once. XML is not a regular language, which a regular expression would work really well on. It's context-free, context meaning that things need to go in a certain order. In this case, things open and close in a particular order, like balanced parenthesis. If you used a regular expression, you may do a LOT of repeditive string scanning. If you do decide to use a parser, which you probably will decide on, you have a choiec of DOM vs SAX. Dom parsers go over the document, and store everything in memory. For a fairly large document, this may take a long time and require a lot of memory. A SAX parser would take away the convenience of doing an all-in-memory style parse, but require you to provide callbacks when tags open and close. This requires little memory, but much more involvement, such as when to start taking in data, when not to.. all based on when tags occur during parsing. After all... `<xml-a> beedebeedebeede <xml-b> danger buck! </xml-b> beedebeedebeede </xml-a>` [download] Is quite legal. For tiny documents, unlimited memory slow parsing, DOM is great. For huge documents, speed or a lot of throw away data, SAX may be worth looking into.	[reply] [d/l]