use strict; use XML::Rules; my $parser = XML::Rules->new( start_rules => [ 'PC-InfoData_urn,PC-Compound_atoms' => 'skip', ], rules => [ _default => 'as is', 'PC-CompoundType_id_cid,PC-InfoData_value_binary' => 'content' +, 'PC-InfoData' => sub { return unless $_[1]->{'PC-InfoData_value'}{'PC-InfoData_va +lue_binary'}; return '@InfoData' => $_[1]->{'PC-InfoData_value'}{'PC-Inf +oData_value_binary'}; }, 'PC-Compound' => sub { my $id = $_[1]->{'PC-Compound_id'}{'PC-CompoundType'}{'PC- +CompoundType_id'}{'PC-CompoundType_id_cid'} or return; # no ID found my $data = $_[1]->{'PC-Compound_props'}{'InfoData'} or return; # no data return $id => $data; }, 'PC-Compounds' => 'pass', ], stripspaces => 7, ); my $data = $parser->parse(\*DATA); use Data::Dumper; print Dumper($data); __DATA__ <PC-Compounds> <PC-Compound> <PC-Compound_id> <PC-CompoundType> <PC-CompoundType_id> <PC-CompoundType_id_cid>1</PC-CompoundType_id_cid> </PC-CompoundType_id> </PC-CompoundType> </PC-Compound_id> <PC-Compound_atoms> </PC-Compound_atoms> <PC-Compound_props> <PC-InfoData> </PC-InfoData> <PC-InfoData> <PC-InfoData_urn> <PC-Urn> </PC-Urn> </PC-InfoData_urn> <PC-InfoData_value> <PC-InfoData_value_binary>00000371E0723800000000000000000000 +0000000000 +00000000000000000000000000000000001E00000000000814E180060208030004000 +8000090080 +000000000000000000108000002001400800007000005200010000024000000000000 +0000000000 +0000000000000000000000000000</PC-InfoData_value_binary> </PC-InfoData_value> </PC-InfoData> <PC-InfoData> </PC-InfoData> </PC-Compound_props> </PC-Compound> </PC-Compounds>
If the XML really looks like this, then this code will give you a reference to a hash of arrays, the keys of the hash will be the values of the <PC-CompoundType_id_cid> and the values of the hash will be arrays of the values of <PC-InfoData_value_binary>. And it will handle the cases of <PC-Compound>s without the ID or the <PC-InfoData_value_binary>.
If there's more data in the file you may add some more tags into the skip-list in the start_rules and maybe even add special rules for some of the tags between <PC-Compound> and <PC-CompoundType_id_cid> or <PC-InfoData_value_binary> to get rid of the child tags and attributes you are not interested in.
The rules specify what data from each branch is to be kept and what data is to be forgotten. The 'content' means we want only the textual content of a tag (no attributes, no subtags), 'as is' means 'remember all data', 'pass' means to remove that tag and add all its data into its parent tag (similar to transforming <R><sub><a>aaa</a><b>bbb</b></sub><c>ccc</c></R> to <R><a>aaa</a><b>bbb</b><c>ccc</c></R>). The rule for <PC-InfoData> forgets all tags that do not contain the data we are interested in and otherwise adds the data into an array named InfoData within the <PC-Compound_props>'s data. Finaly the rule for <PC-Compound> takes the ID from several tags below, takes the data and if both are present adds them to its parent tag's data using the ID as the key (normally an attribute or subtag name) and the data as the value.
And the great thing is that before you start parsing the next <PC-Compound>, the only things from the previous one still in memory are the ID and the binary value you are interested in.
HTH, Jenda
|
Support Denmark! Defend the free world! |
In reply to Re: 1GB XML mining with XML:twig (newbies question)
by Jenda
in thread 1GB XML mining with XML:twig (newbies question)
by karpatov
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |