in reply to XML::Twig parsing poorly structured content

In the h3 handler, set a global header, and use it in the div handler.
#!/usr/bin/perl use warnings; use strict; use feature qw{ say }; use XML::Twig; my $header; my $twig = 'XML::Twig'->new( twig_handlers => { h3 => sub { $header = $_->text; $_->purge; }, 'div[@class="event"]' => sub { say $header, "\t", $_->text; $_->purge; }, }, ); $twig->parsefile('file.xml');

In bigger projects, you don't want to have a global header. Instead, you can create a new class that has two attributes, header and twig, which delegates all the XML related work to the latter and stores the headers in the former.

#!/usr/bin/perl { package XML::Twig::WithHeader; use feature qw{ say }; use Moo; use XML::Twig; has _header => ( is => 'rw', init_arg => undef ); has _twig => ( is => 'lazy', init_arg => undef ); sub _build__twig { my ($self) = @_; my $twig = 'XML::Twig'->new( twig_handlers => { h3 => sub { $self->_header($_->text); $_->purge; }, 'div[@class="event"]' => sub { say $self->_header, "\t", $_->text; $_->purge; }, }, ); } sub parse { my ($self, $file) = @_; $self->_twig->parsefile($file); } } my $twig = 'XML::Twig::WithHeader'->new; $twig->parse('file.xml');

($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,

Replies are listed 'Best First'.
Re^2: XML::Twig parsing poorly structured content
by slugger415 (Monk) on Jan 25, 2017 at 00:36 UTC

    Looks very nice, thank you! and it works, at least for my sample XML.

    I'm not familiar with this handler construction:

    'div[@class="event"]'

    It looks rather XSL-ish. Is there some explanation of how that works? The reason I ask (sheepishly) is that my pseudo XML is simpler than the real stuff, meaning it has sub-levels that I want to parse, e.g.:

    <h3 class="current-day">Thursday, February 2</h3> <div class="event"> <div class="title">Event 1</div> <span class="time">7:30pm</span> <span class="location">Main Street</span> </div> <div class="event"> <div class="title">Event 2</div> <span class="time">9pm</span> <span class="location">Green Street</span> </div>

    Sorry not to be more detailed in my original post. Much appreciated.

      > rather XSL-ish

      It's called XPath. It's used and supported in a wider range of tools/languages/libraries than just XSL. This particular expression means "a div element whose class attribute has the value "event".

      > want to parse

      Then you can't use handlers, as you need access to more than just a subtree. The following shows how to do it. Using XML::LibXML would simplify the code in such a case, in my opinion.

      #!/usr/bin/perl use warnings; use strict; use feature qw{ say }; use XML::Twig; my $twig = 'XML::Twig'->new; $twig->parsefile(shift); my $root = $twig->root; for my $header($root->descendants('h3')) { my $date = $header->text; my @events = $header->next_siblings(sub { my ($elt) = @_; 'div' eq $elt->name && $elt->prev_sibling('h3') == $header }); say join "\t", $date, map $_->text, $_->children for @events; }

      ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
      "I'm not familiar with this handler construction: 'div[@class="event"]'"

      Here's the current W3C Recommendation: "XML Path Language (XPath) 2.0 (Second Edition)".

      In almost all cases, I find the "3.2.4 Abbreviated Syntax" section adequate for my needs. This has a description of 'div[@class="event"]' (as para[@type="warning"]); and lots more besides.

      — Ken

        Thank you Ken! very useful (and more to learn, as always).

        Scott