The trouble is that extracting the records of interest(PC-CompoundType_id_cid,PC-InfoData_value_binary) the way I do can lead to troubles when the item for some compound is missingLet me guess: if a PC-Compound block contains a "Type_id_cid" thingie, but no "InfoData_value_binary" thingie -- or vice-versa -- then your two arrays (@cpds and @bins) are not alignable. Is that it?... my $root= $twig->root; my @cpds= $root->children_text($field1); my @bins= $root->children_text($field2);
So, do you want to limit your processing to just those blocks that have both elements, and skip the others? There's probably a clever way to do that using just the resources and methods provided by XML::Twig, but my first instinct (esp. considering the size of the input file), would be to handle this matter at the stage of reading the data from the file.
Here's something similar to what was proposed in the first reply, to process only the relevant blocks of data (remember, I'm just trying to guess at what you are really trying to do -- apologies if I guessed wrong):
(update: added "(?=" and ")" in the s/// statement -- need to use positive look-ahead there.)my $filename = "whatever..."; { open( my $fh, "<", $filename ) or die "$filename: $!"; local $/ = "</PC-Compound>"; while (<$fh>) { # read one entire PC-Compound block next unless ( m{\w\s*</PC-CompoundType_id_cid} and m{\w\s*</PC-InfoData_value_binary} ); s/^.*?(?=<PC-Compound)//; # remove anything that precedes the + block process_compound( $_ ); } }
The "process_compound" sub can use your favorite XML parsing module on the string that is passed to it.
Looking more closely at your sample of XML input data, it seems like there could be cases where a single "PC-Compound" block (having one "PC-CompoundType_id_cid") could have two or more "PC-InfoData_value_binary" fields. What are you supposed to do if that happens?
In reply to Re: 1GB XML mining with XML:twig (newbies question)
by graff
in thread 1GB XML mining with XML:twig (newbies question)
by karpatov
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |