Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks,
I need to parse an xml document, but I am new to perl and cannot really understand the functions in XML, though they seem to be quite helpful...
So, I try to parse my xml document as a regular text file (I know it's not recommended but I do not have much time now - I will try to learn the perl-xml stuff in the near future I hope).
The part in my xml document I cannot parse is like the following:
<PubmedArticle> <AuthorList CompleteYN="Y"> <Author ValidYN="Y"> <LastName>van Beilen</LastName> <ForeName>J B</ForeName> <Initials>JB</Initials> </Author> <Author ValidYN="Y"> <LastName>Penninga</LastName> <ForeName>D</ForeName> <Initials>D</Initials> </Author> <Author ValidYN="Y"> <LastName>Witholt</LastName> <ForeName>B</ForeName> <Initials>B</Initials> </Author> </AuthorList> </PubmedArticle> <PubmedArticle> <AuthorList CompleteYN="Y"> <Author ValidYN="Y"> <LastName>Wilde</LastName> <ForeName>A</ForeName> <Initials>A</Initials> </Author> <Author ValidYN="Y"> <LastName>Reaves</LastName> <ForeName>B</ForeName> <Initials>B</Initials> </Author> <Author ValidYN="Y"> <LastName>Banting</LastName> <ForeName>G</ForeName> <Initials>G</Initials> </Author> </AuthorList> </Article> </PubmedArticle>
Can you help me? What I try is read the whole part from <Author ValidYN="Y"> to </Author> using the /s in my regex, but then I cannot hold all of the names, it only matches the last or the first name... I am sure this an easy task using Perl and XML functions, but unfortunatelly I am rather newbie and don't know how to use them...

Replies are listed 'Best First'.
Re: xml confuse
by GrandFather (Saint) on Apr 12, 2008 at 12:31 UTC

    "I do not have much time now - I will try to learn the perl-xml stuff in the near future I hope" is exactly backwards. If you haven't much time now to solve the problem then far and away the best option is to ask for help using something like XML::Twig or possibly XML::TreeBuilder (not XML::Simple btw - it isn't).

    For example, consider the following code using XML::TreeBuilder:

    use strict; use warnings; use XML::TreeBuilder; my $xmlDoc = <<XML; <doc> ... </doc> XML my $root = XML::TreeBuilder->new (); $root->parse ($xmlDoc); my @authors = $root->look_down (_tag => 'Author'); for my $authorElt (@authors) { my ($lastName) = $authorElt->look_down (_tag => 'LastName'); my ($foreName) = $authorElt->look_down (_tag => 'ForeName'); next unless defined $lastName and defined $foreName; print $lastName->as_text (), ', ', $foreName->as_text (), "\n"; }

    with the ... replaced by your sample XML prints (assuming the spurious </Article> is removed):

    van Beilen, J B Penninga, D Witholt, B Wilde, A Reaves, B Banting, G

    Perl is environmentally friendly - it saves trees
Re: xml confuse
by ww (Archbishop) on Apr 12, 2008 at 11:26 UTC
    So if you had only seen an airplane in the sky above, or from the passenger cabin, you would attempt to fly one, but only if it were small?

    Better perhaps, to go to flight school...

    Or, in this case, to the Tutorials section here, after which you may have a better idea of which of the XML::* family will make your task easy.

    And, if, and only if the notion that a little flight school would help, you may seek wisdom on regexen, at

    • perldoc perlretut
    • perldoc perlre
    • ... and family
    • ...and (again) the tutorials here.

    some of which may help you understand why using regexen is -- as you already know -- "not recommended" and why using a proper tool might let you avoid becoming a "smoking hole."

    Update: <>strike> to fix wording

Re: xml confuse
by deibyz (Hermit) on Apr 12, 2008 at 11:18 UTC
    As a general rule, trying to parse XML with a regexp is not a very good idea. Why don't you try any of the XML modules you can find on CPAN? You can start, for example, with XML::Simple
      I tried using XML::Simple and it works ok when I have one thing to match, like:
      <Year>1992</Year>
      and I thought it was easy.. But when it came to matching more than one occurences of the same element (Author), I cannot do it... I think it has something to do with objects or something like that and I do not have a clue in this area...
Re: xml confuse
by Jenda (Abbot) on Apr 12, 2008 at 18:04 UTC
    use strict; use XML::Simple qw(XMLin); my $data = XMLin( $xml, ForceArray => [qw(Author)]); #use Data::Dumper; #print Dumper($data); foreach my $author (@{$data->{AuthorList}{Author}}) { print "$author->{ForeName} $author->{LastName}\n"; }
    or
    use strict; use XML::Rules; my $parser = XML::Rules->new( rules => [ _default => 'content', Author => 'as array', AuthorList => sub { return Authors => $_[1]->{Author} }, PubmedArticle => 'pass', ], stripspaces => 7, ); my $data = $parser->parse( $xml); use Data::Dumper; print Dumper($data); foreach my $author (@{$data->{Authors}}) { print "$author->{ForeName} $author->{LastName}\n"; }

    The later is a bit more work, but lets you tweak the resulting datastructure to better fit your needs.