JoeJaz has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I am still pretty bad at regular expressions and could use some help parsing out an XML file using regular expressions (though if there are other ways of getting data from the XML file, I am all ears; though I would prefer not to install a new perl module in doing so). I have an XML file that look something like this: <more>
<EVENT> <NAME>test2</NAME> <LOCATION>iwu</LOCATION> <TIME>now</TIME> <DATE>today</DATE> <PRIORITY>interest</PRIORITY> <ATTENDEES>a lot</ATTENDEES> <DESCRIPTION> descrip</DESCRIPTION> </EVENT> <EVENT> <NAME>test3</NAME> <LOCATION>hi</LOCATION> <TIME>joe</TIME> <DATE>how</DATE> <PRIORITY>interest</PRIORITY> <ATTENDEES>are</ATTENDEES> <DESCRIPTION> </DESCRIPTION> </EVENT>
I would like to count how many events are in the file and store the information about each event into a two dimensional array (or something like that). Here are my thoughts. This regex
$variable =~ s/<EVENT>.*?<\/EVENT>//gi;
,from what I understand, will place the place the text between the EVENT tags into a variable. However, I'm not sure what will happen if I have more than one set of EVENT tags such as in the XML above. I would like to isolate each event (and I want to have an indefinite number of events) into a separate array element. Then I can begin splitting up the data for each event into variables somehow. Does anyone have any ideas on how to do this. I would be very grateful for any help that you could offer me. Thanks for taking the time to read this. Joe Added: Forgive me, the above regex should look something more like this
if ($page_body =~ /.*?<EVENT>(.*?)<\/EVENT>.*?/) { $variable = $1; }
</more>

Replies are listed 'Best First'.
Re: XML Parsing
by Enlil (Parson) on Apr 24, 2004 at 07:48 UTC
    I would really recommend a module for this sort of thing. For example using XML::Simple:
    use strict; use warnings; use XML::Simple; use Data::Dumper; my $string = do { local $/; <DATA>}; my $ref = XMLin($string); print Dumper $ref; my $num_events = @{$ref->{EVENT}}; print "There are $num_events events listed\n"; __DATA__ <ROOT> <EVENT> <NAME>test2</NAME> <LOCATION>iwu</LOCATION> <TIME>now</TIME> <DATE>today</DATE> <PRIORITY>interest</PRIORITY> <ATTENDEES>a lot</ATTENDEES> <DESCRIPTION> descrip</DESCRIPTION> </EVENT> <EVENT> <NAME>test3</NAME> <LOCATION>hi</LOCATION> <TIME>joe</TIME> <DATE>how</DATE> <PRIORITY>interest</PRIORITY> <ATTENDEES>are</ATTENDEES> <DESCRIPTION> </DESCRIPTION> </EVENT> </ROOT>
    As for the code you posted you need to add the /s modifier or the .'s will not match the newline characters (will not cross over lines). If it were me and was doing a quick hack I would probably still use XML::Simple, but as for changing your regex to capture multiple matches you might try something like the following:
    my @events; while ($page_body =~ /<EVENT>(.*?)<\/EVENT>/sg){ push @events, $1 #note $1 might have zero length. }
    Also note that it is pointless to have .*? at the very start of a regular expression as it will cause a lot of of needless backtracking, and never really match anything, as a regex looks for a pattern anywhere in the string (lest it be anchored)

    -enlil

      Just a little nitpick. Replace this:

      my $string = do { local $/; <DATA>}; my $ref = XMLin($string);
      With this:
      my $ref = XMLin(\*DATA);

      jeffa

      L-LL-L--L-LL-L--L-LL-L--
      -R--R-RR-R--R-RR-R--R-RR
      B--B--B--B--B--B--B--B--
      H---H---H---H---H---H---
      (the triplet paradiddle with high-hat)
      
      Thanks a lot for your advice. I will have a study at the XML::Simple module and see what it has to offer. The note about the .*? is good to know. I surely don't want my code needlessly using CPU cycles. That above code snippet is precicely what I was trying to do. Thanks again. Joe
Re: XML Parsing
by mirod (Canon) on Apr 24, 2004 at 09:59 UTC

    Oh boy! One of those! Again! ;--)

    First, let's start with the basics: no you won't write a proper XML parser using regexps. See On XML parsing for some of the things that can, and will, trip your code, and why you shouldn't call what you write an XML Parser if it isn't one.

    Then why can't you install a new module?? Don't you think your time would be better spent learning how to install a module, rather than writing a half-baked sorta-XML parser? If you are worried about distributing the code to people who won't know how to install modules, most likely on Windows, then XML::Parser comes installed with Activestate Perl (it is used by ppm). Use it. Or better yet learn how to use ppm and use a better XML module. And on Unix installing modules is usually easy. If not you can always package an existing pure Perl parser with your code: XML::Parser::Lite for example, or XML::SAX::PurePerl. None of them is a complete XML parser, but they will surely be better than what you will write.

    And if you prove me wrong and write a complete XML parser in pure perl, then you will get complete and unreserved apologies! (the XML::Parser distribution includes some pretty hairy tests, you can use them).

      Thank you for your comments and advice. The article that you sent me was an interesting read. Also, thanks for the module references. They are handy to know about. Regarding what you said about the modules, it's not that I am unable or unwilling to install a module, but it is doubtful that I can convince my school to install the appropriate modules onto the CGI server that I would be placing this program on. Thanks again for you help. I really appreciate your time. Joe
Re: XML Parsing
by graff (Chancellor) on Apr 24, 2004 at 16:02 UTC
    The following comments do not represent the "consensus" view among responsible monks at the Monastery. But they are in the spirit of "TIMTOWTDI"...

    Sometimes, an XML job is really simple, for instance when the job is to read XML data created by some task-specific program that does nothing but put tags around the columns of a particular flat table -- which appears to be what you have in this case. In effect, if you had access to the original flat table (wherever/whatever it may be) before its contents were decorated with XML tags, you wouldn't need to "parse" XML at all; you would just read the table.

    And sometimes, if the XML module(s) you would like are not installed for the perl interpreter you're using (e.g. on a web server that you don't control), it can be... um, a bit complicated or time consuming to get them installed, or to incorporate one of them into your own script. But if you know that the job is just a matter of stripping tags out of XML-ized flat table, (warning: heresy alert (: ) you probably don't need an XML parser for that.

    You could read the input like this (not tested):

    my @tags = qw/NAME LOCATION TIME DATE PRIORITY ATTENDEES DESCRIPTION/; my @events; open( XML, "<datafile.xml" ) or die $!; { local $/ = "</EVENT>"; # input record separator is end-tag while (<XML>) # read one whole <EVENT>...</EVENT> into $_ { my %record = (); for my $t ( @tags ) { if ( m{<$t>\s*([^<]+)} ) # capture text following an open tag { $record{$t} = $1; $record{$t} =~ s/\s+$//; # optional: remove trailing space +s } } push @events, { %record }; # @events is an array of hashes } } close XML; # to get back to the data for later use: for my $i ( 0 .. $#events ) { my $href = $events[$i]; # you get a reference to the hash my %rec_hash = %$href; # you can make a local copy of it, or print "Event #", $i+1, ":\n"; print " $_ = $$href{$_}\n" for ( keys %$href ); # just use the has +h ref }
    Now for the caveats... Your XML data is not simple (and this kind of simple solution will not work) if the input is not really like a flat table. This would be the case if:
    • an event can have two or more instances of a given tag (e.g. multiple descriptions)
    • a given tag within an event can contain optional or variable nested tags (e.g. if "attendees" included XML-tagged sub-categories like "invited" vs. "present")
    • any of the tags can take optional or variable attributes (e.g. <TIME zone="EST">...)

    If your input has any of these features, you could elaborate the "non-parser" approach to handle them, but you might soon reach the point of "diminishing returns", where it would have been better to start with an actual XML parsing module.

      That is a really nice piece of code and some good advice. I wasn't aware of that input seperator code which seems like it would be very helpful for this situation. Also, the hash idea that you use probably would be a better solution than what I had previously been wanting to do. I will consider what you said about using a module for this situation, even if I have to try to embed some modules into my code directory and link to them there. Thank you very much for your time. Your code and information has been useful to me. Take care, Joe
Re: XML Parsing
by blue_cowdawg (Monsignor) on Apr 24, 2004 at 12:57 UTC

    To underscore some of what other monks have already told you, let me recommend a book: Perl & XML by Erik Ray and Jason McIntosh.

    I've only just begun to go through the book but from the "skimming" of it that I've done it has given me loads of ideas already on doing better XML work with Perl.

    The moral of the story: don't make work for yourself. There are lots of good modules out there for dealing with XML starting with XML::Simple, cpan::XML::Parser and friends.

      Pehaps I will attempt to build a module structure into my program in such a way that I don't need root to install the modules. Everyone seems to think modules are the way to go. I'll bang away at it. Thanks for your book recommendation and for your advice. Have a nice weekend. Joe
Re: XML Parsing
by sth (Priest) on Apr 24, 2004 at 22:16 UTC

    I would also recommend buying "XML and Perl" by Mark Reihl and Ilya Sterin, published by New Riders.

    sth

      Good to know. Both are topics that I would like to learn more about so that book would be a fitting choice. Thanks for your input. Joe
      Let me add this url for all those looking for the examples mentioned in the book:
      errata and download pages

      pelagic
XML Parsing, DOM, SAX and regexp.
by exussum0 (Vicar) on Apr 26, 2004 at 01:17 UTC
    Everyone else has pointed out XML::Simple, Twig etc..

    But here's a reason NOT to use regexp's for this task. Regular expressions work really well on "stuff" that doesn't require bouncing around. Even if you use regular expressions as part of the task, you'll wind up rescanning the same strings over and over again.

    An actual parser will go from top down, doing all interpretation at best, once. XML is not a regular language, which a regular expression would work really well on. It's context-free, context meaning that things need to go in a certain order. In this case, things open and close in a particular order, like balanced parenthesis. If you used a regular expression, you may do a LOT of repeditive string scanning.

    If you do decide to use a parser, which you probably will decide on, you have a choiec of DOM vs SAX. Dom parsers go over the document, and store everything in memory. For a fairly large document, this may take a long time and require a lot of memory.

    A SAX parser would take away the convenience of doing an all-in-memory style parse, but require you to provide callbacks when tags open and close. This requires little memory, but much more involvement, such as when to start taking in data, when not to.. all based on when tags occur during parsing. After all...

    <xml-a> beedebeedebeede <xml-b> danger buck! </xml-b> beedebeedebeede </xml-a>
    Is quite legal.

    For tiny documents, unlimited memory slow parsing, DOM is great. For huge documents, speed or a lot of throw away data, SAX may be worth looking into.