use strict; use XML::Rules; my $parser = XML::Rules->new( start_rules => [ 'PC-InfoData_urn,PC-Compound_atoms' => 'skip', ], rules => [ _default => 'as is', 'PC-CompoundType_id_cid,PC-InfoData_value_binary' => 'content' +, 'PC-InfoData' => sub { return unless $_[1]->{'PC-InfoData_value'}{'PC-InfoData_va +lue_binary'}; return '@InfoData' => $_[1]->{'PC-InfoData_value'}{'PC-Inf +oData_value_binary'}; }, 'PC-Compound' => sub { my $id = $_[1]->{'PC-Compound_id'}{'PC-CompoundType'}{'PC- +CompoundType_id'}{'PC-CompoundType_id_cid'} or return; # no ID found my $data = $_[1]->{'PC-Compound_props'}{'InfoData'} or return; # no data return $id => $data; }, 'PC-Compounds' => 'pass', ], stripspaces => 7, ); my $data = $parser->parse(\*DATA); use Data::Dumper; print Dumper($data); __DATA__ <PC-Compounds> <PC-Compound> <PC-Compound_id> <PC-CompoundType> <PC-CompoundType_id> <PC-CompoundType_id_cid>1</PC-CompoundType_id_cid> </PC-CompoundType_id> </PC-CompoundType> </PC-Compound_id> <PC-Compound_atoms> </PC-Compound_atoms> <PC-Compound_props> <PC-InfoData> </PC-InfoData> <PC-InfoData> <PC-InfoData_urn> <PC-Urn> </PC-Urn> </PC-InfoData_urn> <PC-InfoData_value> <PC-InfoData_value_binary>00000371E0723800000000000000000000 +0000000000 +00000000000000000000000000000000001E00000000000814E180060208030004000 +8000090080 +000000000000000000108000002001400800007000005200010000024000000000000 +0000000000 +0000000000000000000000000000</PC-InfoData_value_binary> </PC-InfoData_value> </PC-InfoData> <PC-InfoData> </PC-InfoData> </PC-Compound_props> </PC-Compound> </PC-Compounds>

If the XML really looks like this, then this code will give you a reference to a hash of arrays, the keys of the hash will be the values of the <PC-CompoundType_id_cid> and the values of the hash will be arrays of the values of <PC-InfoData_value_binary>. And it will handle the cases of <PC-Compound>s without the ID or the <PC-InfoData_value_binary>.

If there's more data in the file you may add some more tags into the skip-list in the start_rules and maybe even add special rules for some of the tags between <PC-Compound> and <PC-CompoundType_id_cid> or <PC-InfoData_value_binary> to get rid of the child tags and attributes you are not interested in.

The rules specify what data from each branch is to be kept and what data is to be forgotten. The 'content' means we want only the textual content of a tag (no attributes, no subtags), 'as is' means 'remember all data', 'pass' means to remove that tag and add all its data into its parent tag (similar to transforming <R><sub><a>aaa</a><b>bbb</b></sub><c>ccc</c></R> to <R><a>aaa</a><b>bbb</b><c>ccc</c></R>). The rule for <PC-InfoData> forgets all tags that do not contain the data we are interested in and otherwise adds the data into an array named InfoData within the <PC-Compound_props>'s data. Finaly the rule for <PC-Compound> takes the ID from several tags below, takes the data and if both are present adds them to its parent tag's data using the ID as the key (normally an attribute or subtag name) and the data as the value.

And the great thing is that before you start parsing the next <PC-Compound>, the only things from the previous one still in memory are the ID and the binary value you are interested in.


In reply to Re: 1GB XML mining with XML:twig (newbies question) by Jenda
in thread 1GB XML mining with XML:twig (newbies question) by karpatov

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.