The trouble is that extracting the records of interest(PC-CompoundType_id_cid,PC-InfoData_value_binary) the way I do can lead to troubles when the item for some compound is missing
... my $root= $twig->root; my @cpds= $root->children_text($field1); my @bins= $root->children_text($field2);
Let me guess: if a PC-Compound block contains a "Type_id_cid" thingie, but no "InfoData_value_binary" thingie -- or vice-versa -- then your two arrays (@cpds and @bins) are not alignable. Is that it?

So, do you want to limit your processing to just those blocks that have both elements, and skip the others? There's probably a clever way to do that using just the resources and methods provided by XML::Twig, but my first instinct (esp. considering the size of the input file), would be to handle this matter at the stage of reading the data from the file.

Here's something similar to what was proposed in the first reply, to process only the relevant blocks of data (remember, I'm just trying to guess at what you are really trying to do -- apologies if I guessed wrong):

my $filename = "whatever..."; { open( my $fh, "<", $filename ) or die "$filename: $!"; local $/ = "</PC-Compound>"; while (<$fh>) { # read one entire PC-Compound block next unless ( m{\w\s*</PC-CompoundType_id_cid} and m{\w\s*</PC-InfoData_value_binary} ); s/^.*?(?=<PC-Compound)//; # remove anything that precedes the + block process_compound( $_ ); } }
(update: added "(?=" and ")" in the s/// statement -- need to use positive look-ahead there.)

The "process_compound" sub can use your favorite XML parsing module on the string that is passed to it.

Looking more closely at your sample of XML input data, it seems like there could be cases where a single "PC-Compound" block (having one "PC-CompoundType_id_cid") could have two or more "PC-InfoData_value_binary" fields. What are you supposed to do if that happens?


In reply to Re: 1GB XML mining with XML:twig (newbies question) by graff
in thread 1GB XML mining with XML:twig (newbies question) by karpatov

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.