karpatov has asked for the wisdom of the Perl Monks concerning the following question:

Dear perlMonks,
I have the following problem:
A have an XML output(1GB) of a database of chemical compounds and I would like to extract just some parametrs for every and single of them. I tried to get oriented in examples for XML:twig and the only result of my tries is the code below (XML is completly new for me and perl almost completly). The trouble is that extracting the records of interest(PC-CompoundType_id_cid,PC-InfoData_value_binary) the way I do can lead to troubles when the item for some compound is missing
. Thanks for your help. karpatov
#!/bin/perl -w use strict; use XML::Twig; my $leader_name; my $leader_score=0; #my $field0= 'PC-Compound'; my $field1= 'PC-CompoundType_id_cid'; my $field2= 'PC-InfoData_value_binary'; my $twig= new XML::Twig( twig_roots => {$field1 => 1, $field2 => 1 +} ); $twig->parsefile( "D:/NCI-Open/vice shlavickou.xml"); my $root= $twig->root; my @cpds= $root->children_text($field1); my @bins= $root->children_text($field2); print "@cpds\n"; print "@bins\n"; #$twig->print;

Simplified XML:
<PC-Compound> <PC-Compound_id> <PC-CompoundType> <PC-CompoundType_id> <PC-CompoundType_id_cid>1</PC-CompoundType_id_cid> </PC-CompoundType_id> </PC-CompoundType> </PC-Compound_id> <PC-Compound_atoms> </PC-Compound_atoms> <PC-Compound_props> <PC-InfoData> </PC-InfoData> <PC-InfoData> <PC-InfoData_urn> <PC-Urn> </PC-Urn> </PC-InfoData_urn> <PC-InfoData_value> <PC-InfoData_value_binary>00000371E0723800000000000000000000 +000000000000000000000000000000000000000000001E00000000000814E18006020 +803000400080000900800000000000000000001080000020014008000070000052000 +1000002400000000000000000000000000000000000000000000000000</PC-InfoDa +ta_value_binary> </PC-InfoData_value> </PC-InfoData> <PC-InfoData> </PC-InfoData> </PC-Compound>

Replies are listed 'Best First'.
Re: 1GB XML mining with XML:twig (newbies question)
by pc88mxer (Vicar) on Feb 16, 2008 at 01:28 UTC
    Please elaborate on what kind of troubles you run into. Running out of memory comes to mind -- are there any other problems?

    This seems like a good job for line parsing. From the example fragment you have posted it seems like the XML file is very regular in its structure. If that is the case, I would stream in the file reading one <PC-Compound> element at a time like this:

    my @compound; while (<IN>) { if (m/^\s*<PC-Compound>/) { @compound = ($_); } elsif (m/^\s*<\/PC-Compound>/) { push(@compound, $_); process_compound(); @compound = (); } else { push(@compound, $_) if (@compound); } }

    When process_compound() is called, the array @compound will have the lines for one <PC-Compound> record which you can process with XML::twig or some other XML module. (Also, instead of pushing lines onto an array, you could also append to a string buffer if that's more convenient.)

    Another option is to use something like XSLT to extract the records of interest, but that's a whole other technology.

Re: 1GB XML mining with XML:twig (newbies question)
by graff (Chancellor) on Feb 16, 2008 at 16:08 UTC
    The trouble is that extracting the records of interest(PC-CompoundType_id_cid,PC-InfoData_value_binary) the way I do can lead to troubles when the item for some compound is missing
    ... my $root= $twig->root; my @cpds= $root->children_text($field1); my @bins= $root->children_text($field2);
    Let me guess: if a PC-Compound block contains a "Type_id_cid" thingie, but no "InfoData_value_binary" thingie -- or vice-versa -- then your two arrays (@cpds and @bins) are not alignable. Is that it?

    So, do you want to limit your processing to just those blocks that have both elements, and skip the others? There's probably a clever way to do that using just the resources and methods provided by XML::Twig, but my first instinct (esp. considering the size of the input file), would be to handle this matter at the stage of reading the data from the file.

    Here's something similar to what was proposed in the first reply, to process only the relevant blocks of data (remember, I'm just trying to guess at what you are really trying to do -- apologies if I guessed wrong):

    my $filename = "whatever..."; { open( my $fh, "<", $filename ) or die "$filename: $!"; local $/ = "</PC-Compound>"; while (<$fh>) { # read one entire PC-Compound block next unless ( m{\w\s*</PC-CompoundType_id_cid} and m{\w\s*</PC-InfoData_value_binary} ); s/^.*?(?=<PC-Compound)//; # remove anything that precedes the + block process_compound( $_ ); } }
    (update: added "(?=" and ")" in the s/// statement -- need to use positive look-ahead there.)

    The "process_compound" sub can use your favorite XML parsing module on the string that is passed to it.

    Looking more closely at your sample of XML input data, it seems like there could be cases where a single "PC-Compound" block (having one "PC-CompoundType_id_cid") could have two or more "PC-InfoData_value_binary" fields. What are you supposed to do if that happens?

      Thanks for both the replies. I solved the problem already by means offered by XML::twig. There is possiblity to read just a portion of the data (one PC-Compound), to parse and discard in the end - in principle it is similar to your suggestions in a way:
      my $twig= new XML::Twig( twig_handlers => { PC-Compound => \&subrutineforparsing} ); $twig->parsefile($inputfile);
      As for several PC-InfoData_value_binary (aliases), I load the into an array and than use regular expression to get just the alias from NSC db. karpatov
        Hmm. My solution worked. But was desperately slow and runout of memory errors happend. So I decided to use your strategy (Regex and only then xml-parser) and it is great. Tx. karpatov
Re: 1GB XML mining with XML:twig (newbies question)
by Jenda (Abbot) on Feb 18, 2008 at 15:52 UTC
    use strict; use XML::Rules; my $parser = XML::Rules->new( start_rules => [ 'PC-InfoData_urn,PC-Compound_atoms' => 'skip', ], rules => [ _default => 'as is', 'PC-CompoundType_id_cid,PC-InfoData_value_binary' => 'content' +, 'PC-InfoData' => sub { return unless $_[1]->{'PC-InfoData_value'}{'PC-InfoData_va +lue_binary'}; return '@InfoData' => $_[1]->{'PC-InfoData_value'}{'PC-Inf +oData_value_binary'}; }, 'PC-Compound' => sub { my $id = $_[1]->{'PC-Compound_id'}{'PC-CompoundType'}{'PC- +CompoundType_id'}{'PC-CompoundType_id_cid'} or return; # no ID found my $data = $_[1]->{'PC-Compound_props'}{'InfoData'} or return; # no data return $id => $data; }, 'PC-Compounds' => 'pass', ], stripspaces => 7, ); my $data = $parser->parse(\*DATA); use Data::Dumper; print Dumper($data); __DATA__ <PC-Compounds> <PC-Compound> <PC-Compound_id> <PC-CompoundType> <PC-CompoundType_id> <PC-CompoundType_id_cid>1</PC-CompoundType_id_cid> </PC-CompoundType_id> </PC-CompoundType> </PC-Compound_id> <PC-Compound_atoms> </PC-Compound_atoms> <PC-Compound_props> <PC-InfoData> </PC-InfoData> <PC-InfoData> <PC-InfoData_urn> <PC-Urn> </PC-Urn> </PC-InfoData_urn> <PC-InfoData_value> <PC-InfoData_value_binary>00000371E0723800000000000000000000 +0000000000 +00000000000000000000000000000000001E00000000000814E180060208030004000 +8000090080 +000000000000000000108000002001400800007000005200010000024000000000000 +0000000000 +0000000000000000000000000000</PC-InfoData_value_binary> </PC-InfoData_value> </PC-InfoData> <PC-InfoData> </PC-InfoData> </PC-Compound_props> </PC-Compound> </PC-Compounds>

    If the XML really looks like this, then this code will give you a reference to a hash of arrays, the keys of the hash will be the values of the <PC-CompoundType_id_cid> and the values of the hash will be arrays of the values of <PC-InfoData_value_binary>. And it will handle the cases of <PC-Compound>s without the ID or the <PC-InfoData_value_binary>.

    If there's more data in the file you may add some more tags into the skip-list in the start_rules and maybe even add special rules for some of the tags between <PC-Compound> and <PC-CompoundType_id_cid> or <PC-InfoData_value_binary> to get rid of the child tags and attributes you are not interested in.

    The rules specify what data from each branch is to be kept and what data is to be forgotten. The 'content' means we want only the textual content of a tag (no attributes, no subtags), 'as is' means 'remember all data', 'pass' means to remove that tag and add all its data into its parent tag (similar to transforming <R><sub><a>aaa</a><b>bbb</b></sub><c>ccc</c></R> to <R><a>aaa</a><b>bbb</b><c>ccc</c></R>). The rule for <PC-InfoData> forgets all tags that do not contain the data we are interested in and otherwise adds the data into an array named InfoData within the <PC-Compound_props>'s data. Finaly the rule for <PC-Compound> takes the ID from several tags below, takes the data and if both are present adds them to its parent tag's data using the ID as the key (normally an attribute or subtag name) and the data as the value.

    And the great thing is that before you start parsing the next <PC-Compound>, the only things from the previous one still in memory are the ID and the binary value you are interested in.