Ignore elements using twig module

basalto has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Ignore elements using twig module (->ignore) by ikegami (Patriarch) on Feb 23, 2008 at 10:01 UTC
oo! I just found a much better way! `ignore` was made for this very purpose. `use XML::Twig qw( ); sub twig_data_type { my ($twig, $ele) = @_; $ele->parent()->ignore() if $ele->trimmed_text() eq 'yyy'; return 1; } my $twig = XML::Twig->new( twig_handlers => { 'data/type' => \&twig_data_type, }, # Output will be nicely formatted, but not necessarily valid. pretty_print => 'indented', ); $twig->parse(\*DATA); $twig->print(); __DATA__ <doc> <data> <type>xxx</type> <vars>a</vars> </data> <data> <type>yyy</type> <vars>b</vars> </data> </doc>` [download] `<doc> <data> <type>xxx</type> <vars>a</vars> </data> </doc>` [download]	[reply] [d/l] [select]
Re: Ignore elements using twig module by GrandFather (Saint) on Feb 23, 2008 at 10:08 UTC
I'm not sure how much the following will speed up processing, but it does filter the unwanted elements: `use strict; use warnings; use XML::Twig; my $xml = <<XML; <doc> <data> <type>xxx</type> <vars>a</vars> </data> <data> <type>yyy</type> <vars>b</vars> </data> </doc> XML my $root = XML::Twig->new (twig_handlers => {data => \&handler}); $root->parse ($xml); sub handler { my $elt = $_; return if $elt->children (\&badType); print "Handling ", $elt->text (), "\n"; } sub badType { return $_->text () =~ /^yyy/; }` [download] Prints: `Handling xxxa` [download] Perl is environmentally friendly - it saves trees	[reply] [d/l] [select]
Re^2: Ignore elements using twig module by basalto (Beadle) on Feb 23, 2008 at 11:38 UTC
It seems that ignore() method is what i'm seeking to stop and delete current <data> twig. I'm going to try it in my script and i'll come back as soon as i've results. GrandFather, your sample could be handy but i think in that specific case doesn't help me because i need to stop and purge current twig if type element is matched. To become more clear i add a better sample. My xml file has thousands of <container> elements with thousands of text elements to extract and import to one database. My ideia is to "twig" all <container> elements, but to speed up i need to exclude containers that match some kind of types. Just to be more dificult, <container> elements can be nested. `<container> <attribute> <type>xxx</type> </attribute> <data> <var1>a</var1> <var2>b</var2> </data> </container> <container> <attribute> <type>yyy</type> </attribute> <variables> <var1>a</var1> <var2>b</var2> </variables> </container>` [download]	[reply] [d/l]
Re^3: Ignore elements using twig module by basalto (Beadle) on Mar 02, 2008 at 00:32 UTC
Hi, Sorry the delay, but i don't have too much time to spend coding. This is not my job and I'm doing this just to get some skills about processing XML data. Concerning my initial question, I can say that after i apply ignore() method on my program, processing time had a huge reduction as expected. Parsing time drops 33% when input file has 270 MB (initial code takes 9m24s and now takes only 6m16s). Thank you for your support. Ricardo	[reply]
Re: Ignore elements using twig module by ikegami (Patriarch) on Feb 23, 2008 at 02:25 UTC
I'm not familiar with XML::Twig, so I don't know if those paths are XPaths, but if they are, you could use `child::type[text()!="yyy"]/..` (if `<data>` is context node) or `//type[text()!="yyy"]/..` (independent of context). Update: I couldn't leave it at that, so I looked deeper... Foiled! The docs say "XPath expressions are limited to using the child and descendant axis". While it also handles the attribute axis (via "@"), I verified that the parent axis (even via "..") isn't supported. It doesn't seem to understand `text()` either, which also requires advance knowledge of the node's children. All in all, that makes sense. How can you skip parsing something based on something that hasn't been parsed yet. A better schema for your needs would have been `<doc> <data type="xxx"> <vars>a</vars> </data> <data type="yyy"> <vars>b</vars> </data> </doc>` [download] Then it would be easy: `use XML::Twig; my $twig = XML::Twig->new( twig_roots => { 'data[@type != "yyy"]' => 1, }, ); $twig->parse(\*DATA); $twig->print(); __DATA__ <doc> <data type="xxx"> <vars>a</vars> </data> <data type="yyy"> <vars>b</vars> </data> </doc>` [download] `<doc><data type="xxx"><vars>a</vars></data></doc>` [download] (`!=` didn't work with v3.26, but worked after upgrading to v3.32)	[reply] [d/l] [select]
Re^2: Ignore elements using twig module by basalto (Beadle) on Feb 23, 2008 at 08:41 UTC
If type is defined as attribute it works, but changing the structure is not an option. Thanks anyway.	[reply]
Re: Ignore elements using twig module by ikegami (Patriarch) on Feb 23, 2008 at 09:51 UTC
How's this? use XML::Twig qw( ); # Assumptions # - "data" elements can't be nested. # - Only one twig instance is used at a time. my $prune_data; sub twig_data_start { my ($twig, $ele) = @_; $prune_data = 0; return 1; } sub twig_data_type { my ($twig, $ele) = @_; $prune_data = 1 if $ele->trimmed_text() eq 'yyy'; return 1; } sub twig_data { my ($twig, $ele) = @_; return 1 if !$prune_data; $prune_data = 0; $ele->delete(); return 0; } my $twig = XML::Twig->new( start_tag_handlers => { 'data' => \&twig_data_start, }, twig_handlers => { 'data' => \&twig_data, 'data/type' => \&twig_data_type, }, # Output will be nicely formatted, but not necessarily valid. pretty_print => 'indented', ); $twig->parse(\*DATA); $twig->print(); __DATA__ <doc> <data> <type>xxx</type> <vars>a</vars> </data> <data> <type>yyy</type> <vars>b</vars> </data> </doc> [download] `<doc> <data> <type>xxx</type> <vars>a</vars> </data> </doc>` [download]	[reply] [d/l] [select]