basalto has asked for the wisdom of the Perl Monks concerning the following question:

I'm parsing a big xml file with help of Twig module. To speed up the script i'm trying to ignore some parts of the file using twig_roots and ignore_elts. My problem arises because all unwanted parts are childs of the same kind of parent tag (<data> in example).

<doc> <data> <type>xxx</type> <vars>a</vars> </data> <data> <type>yyy</type> <vars>b</vars> </data> </doc>

Anyone can show me how to ignore <data> elements, that have <type> equal to 'yyy'?

Thanks, Ricardo

Replies are listed 'Best First'.
Re: Ignore elements using twig module (->ignore)
by ikegami (Patriarch) on Feb 23, 2008 at 10:01 UTC
    oo! I just found a much better way! ignore was made for this very purpose.
    use XML::Twig qw( ); sub twig_data_type { my ($twig, $ele) = @_; $ele->parent()->ignore() if $ele->trimmed_text() eq 'yyy'; return 1; } my $twig = XML::Twig->new( twig_handlers => { 'data/type' => \&twig_data_type, }, # Output will be nicely formatted, but not necessarily valid. pretty_print => 'indented', ); $twig->parse(\*DATA); $twig->print(); __DATA__ <doc> <data> <type>xxx</type> <vars>a</vars> </data> <data> <type>yyy</type> <vars>b</vars> </data> </doc>
    <doc> <data> <type>xxx</type> <vars>a</vars> </data> </doc>
Re: Ignore elements using twig module
by GrandFather (Saint) on Feb 23, 2008 at 10:08 UTC

    I'm not sure how much the following will speed up processing, but it does filter the unwanted elements:

    use strict; use warnings; use XML::Twig; my $xml = <<XML; <doc> <data> <type>xxx</type> <vars>a</vars> </data> <data> <type>yyy</type> <vars>b</vars> </data> </doc> XML my $root = XML::Twig->new (twig_handlers => {data => \&handler}); $root->parse ($xml); sub handler { my $elt = $_; return if $elt->children (\&badType); print "Handling ", $elt->text (), "\n"; } sub badType { return $_->text () =~ /^yyy/; }

    Prints:

    Handling xxxa

    Perl is environmentally friendly - it saves trees
      It seems that ignore() method is what i'm seeking to stop and delete current <data> twig.

      I'm going to try it in my script and i'll come back as soon as i've results.

      GrandFather, your sample could be handy but i think in that specific case doesn't help me because i need to stop and purge current twig if type element is matched.

      To become more clear i add a better sample. My xml file has thousands of <container> elements with thousands of text elements to extract and import to one database. My ideia is to "twig" all <container> elements, but to speed up i need to exclude containers that match some kind of types. Just to be more dificult, <container> elements can be nested.

      <container> <attribute> <type>xxx</type> </attribute> <data> <var1>a</var1> <var2>b</var2> </data> </container> <container> <attribute> <type>yyy</type> </attribute> <variables> <var1>a</var1> <var2>b</var2> </variables> </container>
        Hi,

        Sorry the delay, but i don't have too much time to spend coding. This is not my job and I'm doing this just to get some skills about processing XML data.

        Concerning my initial question, I can say that after i apply ignore() method on my program, processing time had a huge reduction as expected. Parsing time drops 33% when input file has 270 MB (initial code takes 9m24s and now takes only 6m16s).

        Thank you for your support.

        Ricardo

Re: Ignore elements using twig module
by ikegami (Patriarch) on Feb 23, 2008 at 02:25 UTC
    I'm not familiar with XML::Twig, so I don't know if those paths are XPaths, but if they are, you could use child::type[text()!="yyy"]/.. (if <data> is context node) or //type[text()!="yyy"]/.. (independent of context).

    Update: I couldn't leave it at that, so I looked deeper...

    Foiled! The docs say "XPath expressions are limited to using the child and descendant axis". While it also handles the attribute axis (via "@"), I verified that the parent axis (even via "..") isn't supported. It doesn't seem to understand text() either, which also requires advance knowledge of the node's children.

    All in all, that makes sense. How can you skip parsing something based on something that hasn't been parsed yet.

    A better schema for your needs would have been

    <doc> <data type="xxx"> <vars>a</vars> </data> <data type="yyy"> <vars>b</vars> </data> </doc>

    Then it would be easy:

    use XML::Twig; my $twig = XML::Twig->new( twig_roots => { 'data[@type != "yyy"]' => 1, }, ); $twig->parse(\*DATA); $twig->print(); __DATA__ <doc> <data type="xxx"> <vars>a</vars> </data> <data type="yyy"> <vars>b</vars> </data> </doc>
    <doc><data type="xxx"><vars>a</vars></data></doc>

    (!= didn't work with v3.26, but worked after upgrading to v3.32)

      If type is defined as attribute it works, but changing the structure is not an option. Thanks anyway.
Re: Ignore elements using twig module
by ikegami (Patriarch) on Feb 23, 2008 at 09:51 UTC
    How's this?
    use XML::Twig qw( ); # Assumptions # - "data" elements can't be nested. # - Only one twig instance is used at a time. my $prune_data; sub twig_data_start { my ($twig, $ele) = @_; $prune_data = 0; return 1; } sub twig_data_type { my ($twig, $ele) = @_; $prune_data = 1 if $ele->trimmed_text() eq 'yyy'; return 1; } sub twig_data { my ($twig, $ele) = @_; return 1 if !$prune_data; $prune_data = 0; $ele->delete(); return 0; } my $twig = XML::Twig->new( start_tag_handlers => { 'data' => \&twig_data_start, }, twig_handlers => { 'data' => \&twig_data, 'data/type' => \&twig_data_type, }, # Output will be nicely formatted, but not necessarily valid. pretty_print => 'indented', ); $twig->parse(\*DATA); $twig->print(); __DATA__ <doc> <data> <type>xxx</type> <vars>a</vars> </data> <data> <type>yyy</type> <vars>b</vars> </data> </doc>
    <doc> <data> <type>xxx</type> <vars>a</vars> </data> </doc>