in reply to 1GB XML mining with XML:twig (newbies question)

The trouble is that extracting the records of interest(PC-CompoundType_id_cid,PC-InfoData_value_binary) the way I do can lead to troubles when the item for some compound is missing
... my $root= $twig->root; my @cpds= $root->children_text($field1); my @bins= $root->children_text($field2);
Let me guess: if a PC-Compound block contains a "Type_id_cid" thingie, but no "InfoData_value_binary" thingie -- or vice-versa -- then your two arrays (@cpds and @bins) are not alignable. Is that it?

So, do you want to limit your processing to just those blocks that have both elements, and skip the others? There's probably a clever way to do that using just the resources and methods provided by XML::Twig, but my first instinct (esp. considering the size of the input file), would be to handle this matter at the stage of reading the data from the file.

Here's something similar to what was proposed in the first reply, to process only the relevant blocks of data (remember, I'm just trying to guess at what you are really trying to do -- apologies if I guessed wrong):

my $filename = "whatever..."; { open( my $fh, "<", $filename ) or die "$filename: $!"; local $/ = "</PC-Compound>"; while (<$fh>) { # read one entire PC-Compound block next unless ( m{\w\s*</PC-CompoundType_id_cid} and m{\w\s*</PC-InfoData_value_binary} ); s/^.*?(?=<PC-Compound)//; # remove anything that precedes the + block process_compound( $_ ); } }
(update: added "(?=" and ")" in the s/// statement -- need to use positive look-ahead there.)

The "process_compound" sub can use your favorite XML parsing module on the string that is passed to it.

Looking more closely at your sample of XML input data, it seems like there could be cases where a single "PC-Compound" block (having one "PC-CompoundType_id_cid") could have two or more "PC-InfoData_value_binary" fields. What are you supposed to do if that happens?

Replies are listed 'Best First'.
Re^2: 1GB XML mining with XML:twig (newbies question)
by Anonymous Monk on Feb 16, 2008 at 19:08 UTC
    Thanks for both the replies. I solved the problem already by means offered by XML::twig. There is possiblity to read just a portion of the data (one PC-Compound), to parse and discard in the end - in principle it is similar to your suggestions in a way:
    my $twig= new XML::Twig( twig_handlers => { PC-Compound => \&subrutineforparsing} ); $twig->parsefile($inputfile);
    As for several PC-InfoData_value_binary (aliases), I load the into an array and than use regular expression to get just the alias from NSC db. karpatov
      Hmm. My solution worked. But was desperately slow and runout of memory errors happend. So I decided to use your strategy (Regex and only then xml-parser) and it is great. Tx. karpatov