Re: XML Parsing

The following comments do not represent the "consensus" view among responsible monks at the Monastery. But they are in the spirit of "TIMTOWTDI"...

Sometimes, an XML job is really simple, for instance when the job is to read XML data created by some task-specific program that does nothing but put tags around the columns of a particular flat table -- which appears to be what you have in this case. In effect, if you had access to the original flat table (wherever/whatever it may be) before its contents were decorated with XML tags, you wouldn't need to "parse" XML at all; you would just read the table.

And sometimes, if the XML module(s) you would like are not installed for the perl interpreter you're using (e.g. on a web server that you don't control), it can be... um, a bit complicated or time consuming to get them installed, or to incorporate one of them into your own script. But if you know that the job is just a matter of stripping tags out of XML-ized flat table, (warning: heresy alert (: ) you probably don't need an XML parser for that.

You could read the input like this (not tested):

my @tags = qw/NAME LOCATION TIME DATE PRIORITY ATTENDEES DESCRIPTION/;

my @events;

open( XML, "<datafile.xml" ) or die $!;
{
   local $/ = "</EVENT>";  # input record separator is end-tag

   while (<XML>)   # read one whole <EVENT>...</EVENT> into $_
   {
      my %record = ();
      for my $t ( @tags )
      {
         if ( m{<$t>\s*([^<]+)} ) # capture text following an open tag
         {
            $record{$t} = $1;
            $record{$t} =~ s/\s+$//; # optional: remove trailing space
+s
         }
      }
      push @events, { %record }; # @events is an array of hashes
   }
}
close XML;

# to get back to the data for later use:

for my $i ( 0 .. $#events ) {
    my $href = $events[$i]; # you get a reference to the hash
    my %rec_hash = %$href;  # you can make a local copy of it, or
    print "Event #", $i+1, ":\n";
    print " $_ = $$href{$_}\n" for ( keys %$href ); # just use the has
+h ref
}
[download]

Now for the caveats... Your XML data is not simple (and this kind of simple solution will not work) if the input is not really like a flat table. This would be the case if:

an event can have two or more instances of a given tag (e.g. multiple descriptions)
a given tag within an event can contain optional or variable nested tags (e.g. if "attendees" included XML-tagged sub-categories like "invited" vs. "present")
any of the tags can take optional or variable attributes (e.g. <TIME zone="EST">...)

If your input has any of these features, you could elaborate the "non-parser" approach to handle them, but you might soon reach the point of "diminishing returns", where it would have been better to start with an actual XML parsing module.

Comment on Re: XML Parsing Download Code

Replies are listed 'Best First'.
Re: Re: XML Parsing by JoeJaz (Monk) on Apr 24, 2004 at 18:50 UTC
That is a really nice piece of code and some good advice. I wasn't aware of that input seperator code which seems like it would be very helpful for this situation. Also, the hash idea that you use probably would be a better solution than what I had previously been wanting to do. I will consider what you said about using a module for this situation, even if I have to try to embed some modules into my code directory and link to them there. Thank you very much for your time. Your code and information has been useful to me. Take care, Joe	[reply]

Replies are listed 'Best First'.

Re: Re: XML Parsing
by JoeJaz (Monk) on Apr 24, 2004 at 18:50 UTC

That is a really nice piece of code and some good advice. I wasn't aware of that input seperator code which seems like it would be very helpful for this situation. Also, the hash idea that you use probably would be a better solution than what I had previously been wanting to do. I will consider what you said about using a module for this situation, even if I have to try to embed some modules into my code directory and link to them there. Thank you very much for your time. Your code and information has been useful to me. Take care, Joe

[reply]