diamantis has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks!

I am trying to parse a simple XML file with the XML::Parser module. I use the Stream style as it seems more appropriate for my needs. I would like to make a graph out of some entries in the xml file, but I have a hard time parsing correctly the text of some entries. It seems that the Text subroutine is called a couple of times in the beginning AND in the end of each entry. Is there any way to control this behavior? Perhaps I haven't understood something about parsing XML's but it seems to me more reasonable to call the Text sub once for each entry in a consistent way (beginning OR end) which would make my code much simpler and easy.

Any recommendations on what to do?

The main part of the input file looks like:

<entry> this is the text I would like to parse</entry> <entry>more text <secondentry>interesting text</secondentry> </entry>
so I made a stack containing the innermost entry (adding in sub StartTag and removing when in EndTag) I was planning to add a list for pushing Text strings in it while in sub Text, but...I got stuck

Thanks!

Replies are listed 'Best First'.
Re: confused with Stream style in XML::Parser
by Fletch (Bishop) on Oct 06, 2008 at 18:43 UTC

    You've misunderstood the idea behind stream mode I believe. Pure stream mode parsers read the XML bit by bit and when they've recognized a significant chunk (for lack of a better term; e.g. a start tag, an end tag, a stretch of character data) they stop and pass that information back to whatever corresponding callback you've configured. In your example's second <entry> element there looks to be 6 significant chunks:

    • an open <entry>
    • a section of text, "more text\n  "
    • an open <secondentry>
    • another section of text, "interesting text"
    • a close </secondentry>
    • and lastly the close </entry>

    For what you're sounding like you want (all of the text of an entry regardless of nesting to be one "chunk"), the parser would have to have some way of knowing what start/end tags are significant or ignorable. I believe you might be able to use XML::Twig's stream/tree hybrid mode to get this kind of behavior by setting up a twig_handler for <entry> elements and then calling the text method on the element when that's recognized.

    Update: On second read I think you've understood where the problem is you've just gotten stuck trying to figure out how to get around it (and you're on the right track with the stack idea; it's more you'd want an accumulator to which you keep appending your text chunks until you see the outer most end tag)

    The cake is a lie.
    The cake is a lie.
    The cake is a lie.

Re: confused with Stream style in XML::Parser
by Anonymous Monk on Oct 06, 2008 at 21:14 UTC
    Any recommendations on what to do?
    Try XML::Rules, something like
    #!/usr/bin/perl -- use strict; use warnings; use XML::Rules; my $xml = q~<root> <entry> this is the text I would like to parse</entry> <entry>more text <secondentry>interesting text</secondentry> </entry> </root>~; my $parser = XML::Rules->new( stripspaces => 7, rules => { entry => sub { my ($tag, $atts) = @_; print $atts->{_content},"\n"; return; }, }, ); $parser->parse($xml); __END__ this is the text I would like to parse more text