in reply to 1GB XML mining with XML:twig (newbies question)
The trouble is that extracting the records of interest(PC-CompoundType_id_cid,PC-InfoData_value_binary) the way I do can lead to troubles when the item for some compound is missingLet me guess: if a PC-Compound block contains a "Type_id_cid" thingie, but no "InfoData_value_binary" thingie -- or vice-versa -- then your two arrays (@cpds and @bins) are not alignable. Is that it?... my $root= $twig->root; my @cpds= $root->children_text($field1); my @bins= $root->children_text($field2);
So, do you want to limit your processing to just those blocks that have both elements, and skip the others? There's probably a clever way to do that using just the resources and methods provided by XML::Twig, but my first instinct (esp. considering the size of the input file), would be to handle this matter at the stage of reading the data from the file.
Here's something similar to what was proposed in the first reply, to process only the relevant blocks of data (remember, I'm just trying to guess at what you are really trying to do -- apologies if I guessed wrong):
(update: added "(?=" and ")" in the s/// statement -- need to use positive look-ahead there.)my $filename = "whatever..."; { open( my $fh, "<", $filename ) or die "$filename: $!"; local $/ = "</PC-Compound>"; while (<$fh>) { # read one entire PC-Compound block next unless ( m{\w\s*</PC-CompoundType_id_cid} and m{\w\s*</PC-InfoData_value_binary} ); s/^.*?(?=<PC-Compound)//; # remove anything that precedes the + block process_compound( $_ ); } }
The "process_compound" sub can use your favorite XML parsing module on the string that is passed to it.
Looking more closely at your sample of XML input data, it seems like there could be cases where a single "PC-Compound" block (having one "PC-CompoundType_id_cid") could have two or more "PC-InfoData_value_binary" fields. What are you supposed to do if that happens?
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: 1GB XML mining with XML:twig (newbies question)
by Anonymous Monk on Feb 16, 2008 at 19:08 UTC | |
by karpatov (Beadle) on Feb 18, 2008 at 16:00 UTC |