LanX has asked for the wisdom of the Perl Monks concerning the following question:

Hi

Doing again some XML after a long time and trying out XML::Twig

That's example code running on a node from HaukeX

I was looking for a more generic way that writing handlers for each tag and found the ->simplify method, which looks good enough for that task. (yeah I know XML::Simple is evil but so seems the monasteries output too ;-p )

use strict; use warnings; use Data::Dump qw/pp dd/; my $data= join "", <DATA>; use XML::Twig; $\="\n"; print "=== HANDLER:\n"; my $twig=XML::Twig->new( twig_handlers => { 'field[@name="doctext"]' => sub { print $_->gi,"Post: ",$_->child_text(0) }, 'author' => sub { print "ID: ", $_->att("id"); print "Name: ", $_->child_trimmed_text(0); }, }, ); $twig->parse($data); print "=== SIMPLIFIED:\n"; $twig=XML::Twig->new(); print pp $twig->parse( $data)->simplify(); __DATA__ <?xml version="1.0" encoding="Windows-1252"?> <node id="11100665" title="Re^5: What does $_ = qq~&quot;$_&quot;~ do? +" created="2019-05-28 16:28:57" updated="2019-05-28 16:28:57"> <type id="11"> note</type> <author id="830549"> haukex</author> <data> <field name="doctext"> &lt;p&gt;More fun facts! I once wrote a script to search a word list f +or words that make valid regexen which convert one valid word into an +other.&lt;/p&gt; &lt;c&gt; $ perl -le 'print bangs =~s engender' bands $ perl -le 'print halved =~s avatar' halted $ perl -le 'print stove =~s evener' stone &lt;/c&gt; </field> <field name="root_node"> 11100593</field> <field name="parent_node"> 11100640</field> <field name="reputation"> 21</field> </data> </node>

what I don't like are the leading newlines in many content fields, like in content => "\nhaukex"

=== HANDLER: ID: 830549 Name: haukex fieldPost: <p>More fun facts! I once wrote a script to search a word list for wor +ds that make valid regexen which convert one valid word into another. +</p> <c> $ perl -le 'print bangs =~s engender' bands $ perl -le 'print halved =~s avatar' halted $ perl -le 'print stove =~s evener' stone </c> === SIMPLIFIED: { author => { 830549 => { content => "\nhaukex" } }, created => "2019-05-28 16:28:57", data => { field => { doctext => { content => "\n<p>More fun facts! I o +nce wrote a script to search a word list for words that make valid re +gexen which convert one valid word into another.</p>\n<c>\n\$ perl -l +e 'print bangs =~s engender'\nbands\n\$ perl -le 'print halved =~s av +atar'\nhalted\n\$ perl -le 'print stove =~s evener'\nstone\n</c>\n", }, parent_node => { content => "\n11100640" }, reputation => { content => "\n21" }, root_node => { content => "\n11100593" }, }, }, title => "Re^5: What does \$_ = qq~\"\$_\"~ do?", type => { 11 => { content => "\nnote" } }, updated => "2019-05-28 16:28:57", }

I couldn't find an option for ->simplify(%options) to trim the content.

I had to use child_trimmed_text(0) when writing handlers....

Question:

Cheers Rolf
(addicted to the Perl Programming Language :)
Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

Replies are listed 'Best First'.
Re: XML::Twig and the monasteries XML
by choroba (Cardinal) on May 31, 2019 at 08:28 UTC
    I would definitely use XML::LibXML or XML::XSH2. But they work differently to XML::Twig or XML::Simple. It's best to work directly on the DOM object instead of creating a structure that you need to serialize back to XML later.

    It's a bit garrulous, but something like this creates the same structure from the ;xmlstyle=flat:

    #! /usr/bin/perl use strict; use warnings; use Data::Dump qw{ pp }; use XML::LibXML; sub with_id { my ($dom, $xpath) = @_; return { name => $dom->findvalue("normalize-space($xpath)"), id => $dom->findvalue("$xpath/\@id"), } } my $dom = 'XML::LibXML'->load_xml(IO => *DATA); my %node = ( author => with_id($dom, '/node/author'), type => with_id($dom, '/node/type'), ); @node{qw{ created updated title }} = @{ $dom->findnodes('/node')->[0] }{qw{ created updated title }}; @node{qw{ content parent root reputation }} = map $dom->findvalue("/node/$_"), qw( doctext parent_node root_node reputation ); print pp \%node; __DATA__ ...

    map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
      Many thanks I'll try it out! :)

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

Re: XML::Twig and the monasteries XML
by holli (Abbot) on May 30, 2019 at 23:01 UTC
    XML::Twig ISA XML::Parser so it doesn't touch the content by default, but you can add a input filter.


    holli

    You can lead your users to water, but alas, you cannot drown them.
      Thanks, defining an input_filter did the trick °

      $twig=XML::Twig->new( #discard_spaces => 1, input_filter => sub { $_[0] =~ s/^\n//r }, );

      Also found discard_spaces but couldn't make it work though.

      updated

      or rather not !

      I checked all calls to the callback, and the input filter is run on each line of input.

      I.e. it will also discard newlines in the middle of a content.

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

Re: XML::Twig and the monasteries XML (SOLVED)
by LanX (Saint) on May 31, 2019 at 00:38 UTC
    > do the newlines serve any purpose or is it a limitation from XML::Fling (no link, couldn't find it on CPAN) ?

    I was able to solve it by reading our XML docs

    The default XML is older and has various properties that make it easier to parse with a regex, but conversly harder to handle with normal XML manipulation tools. This has been resolved by adding the xmlstlye=clean and xmlstyle=flat settings.

    adding xmlstyle=clean solves the issue

    https://perlmonks.org/?displaytype=xml;node_id=11100756;xmlstyle=clean

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

Re: XML::Twig and the monasteries XML
by Jenda (Abbot) on Jun 02, 2019 at 18:41 UTC

    You might want to try a different module, XML::Rules.

    use strict; use XML::Rules; use Data::Dumper qw(Dumper); my $parser = XML::Rules->new( stripspaces => 15, rules => { 'type,author' => sub { return ( $_[0].'Id' => $_[1]->{id}, $_[ +0] => $_[1]->{_content}); }, field => sub { return ( $_[1]->{name} => $_[1]->{_content}); } +, data => 'pass no content', node => 'pass no content' } ); print Dumper($parser->parse(\*DATA)); __DATA__ <?xml version="1.0" encoding="Windows-1252"?> <node id="11100665" title="Re^5: What does $_ = qq~&quot;$_&quot;~ do? +" created="2019-05-28 16:28:57" updated="2019-05-28 16:28:57"> <type id="11"> note</type> <author id="830549"> haukex</author> <data> <field name="doctext"> &lt;p&gt;More fun facts! I once wrote a script to search a word list f +or words that make valid regexen which convert one valid word into an +other.&lt;/p&gt; &lt;c&gt; $ perl -le 'print bangs =~s engender' bands $ perl -le 'print halved =~s avatar' halted $ perl -le 'print stove =~s evener' stone &lt;/c&gt; </field> <field name="root_node"> 11100593</field> <field name="parent_node"> 11100640</field> <field name="reputation"> 21</field> </data> </node>

    Jenda
    Enoch was right!
    Enjoy the last years of Rome.

      Thanks, will have a look at it! :)

      Cheers Rolf

      Enoch was right wing! ;P

Re: XML::Twig and the monasteries XML ( normalize_space normalise_space )
by Anonymous Monk on May 31, 2019 at 03:02 UTC
    normalise_space? That seems like the ticket
      > normalise_space? That seems like the ticket

      nope

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

        nope

        Works for me, no leading newlines

        { author => { 830549 => { content => "haukex" } }, created => "2019-05-28 16:28:57", data => { field => { doctext => { content => "<p>More fun facts! I onc +e wrote a script to search a word list for words that make valid rege +xen which convert one valid word into another.</p> <c> \$ perl -le 'p +rint bangs =~s engender' bands \$ perl -le 'print halved =~s avatar' +halted \$ perl -le 'print stove =~s evener' stone </c>", }, parent_node => { content => 11100640 }, reputation => { content => 21 }, root_node => { content => 11100593 }, }, }, title => "Re^5: What does \$_ = qq~\"\$_\"~ do?", type => { 11 => { content => "note" } }, updated => "2019-05-28 16:28:57", }