Re: Apache +XML parsing

Replies are listed 'Best First'.

Re^2: Apache +XML parsing
by Jenda (Abbot) on Nov 08, 2008 at 23:17 UTC

That article is six years old. And is written as if DOM and SAX were the only contenders. soliplaya, have a look at XML::Twig and XML::Rules. The first is often recommended here, the second is mine :-) and a nice combination of event and tree based parsing.

Jenda
Support Denmark!
Defend the free world!

[reply]

Re^3: Apache +XML parsing

by soliplaya (Beadle) on Nov 09, 2008 at 14:00 UTC

my $href = {
  'TI' => [ 'content of <PubArticle><Article><Title> tag' ],
  'AU' => [ 'content of <PubArticle><Article><Authors><Author name="au
+thor1" tag',
            'content of <PubArticle><Article><Authors><Author name="au
+thor2" tag',
            etc.. ],
  'REF' => [ and so on... ]
};
[download]

[reply]
[d/l]

Re^4: Apache +XML parsing

by Jenda (Abbot) on Nov 09, 2008 at 17:55 UTC

You can use the XML::Rules->inferRulesFromExample() (or XML::Rules->inferRulesFromDTD() if you have the DTD) to get the basic set of rules for the document and see what data structure would that create (using Data::Dumper). Then you can start tweaking the rules to create a nicer structure. For example, if you do not need the author names split into parts and only want the valid ones you can delete the 'Author' from the list of tags with the 'as array' built-in rule and add a rule like this:

  'Author' => sub {
    return unless $_[1]->{ValidYN} eq 'Y';
    return "$_[1]->{ForeName} $_[1]->{LastName}";
  },
[download]

And then continue tweaking the rules to filter the stuff you are not interested in, format stuff the way you want, rename hash keys etc.

Jenda
Support Denmark!
Defend the free world!

[reply]
[d/l]
[select]

Re^5: Apache +XML parsing

by soliplaya (Beadle) on Nov 18, 2008 at 10:53 UTC

Re^6: Apache +XML parsing

by Jenda (Abbot) on Nov 18, 2008 at 15:21 UTC

Re^3: Apache +XML parsing

by Anonymous Monk on Nov 09, 2008 at 09:46 UTC

XML::Rules is sax (maybe both), and XML::Twig is definetly both.

[reply]

Re^4: Apache +XML parsing

by Jenda (Abbot) on Nov 09, 2008 at 17:40 UTC

Nope. Rules is a stream parser with a twist, but is in no way related to SAX (Simple API for XML) standard and XML::Twig is a tree based parser with a twist with no relation to the DOM standard. Both sit on top of XML::Parser (actually, XML::Parser::Expat in case of XML::Rules, but that's part of the package). There are several Perl implementations of both SAX and DOM and several modules that have their own different (usually more perlish) API.

Jenda
Support Denmark!
Defend the free world!

[reply]