1GB XML mining with XML:twig (newbies question)

karpatov has asked for the wisdom of the Perl Monks concerning the following question:

Dear perlMonks,
I have the following problem:
A have an XML output(1GB) of a database of chemical compounds and I would like to extract just some parametrs for every and single of them. I tried to get oriented in examples for XML:twig and the only result of my tries is the code below (XML is completly new for me and perl almost completly). The trouble is that extracting the records of interest(PC-CompoundType_id_cid,PC-InfoData_value_binary) the way I do can lead to troubles when the item for some compound is missing
. Thanks for your help. karpatov

#!/bin/perl -w
use strict;
use XML::Twig;

my $leader_name;
my $leader_score=0;

#my $field0= 'PC-Compound';
my $field1= 'PC-CompoundType_id_cid';
my $field2= 'PC-InfoData_value_binary';

my $twig= new XML::Twig( twig_roots    => {$field1 => 1, $field2 => 1 
+} );
$twig->parsefile( "D:/NCI-Open/vice shlavickou.xml");
my $root= $twig->root;
my @cpds= $root->children_text($field1);
my @bins= $root->children_text($field2);
print "@cpds\n";
print "@bins\n";
#$twig->print;
[download]

Simplified XML:

<PC-Compound>
    <PC-Compound_id>
      <PC-CompoundType>
        <PC-CompoundType_id>
          <PC-CompoundType_id_cid>1</PC-CompoundType_id_cid>
        </PC-CompoundType_id>
      </PC-CompoundType>
    </PC-Compound_id>
    <PC-Compound_atoms>
    </PC-Compound_atoms>
    
    
   
    <PC-Compound_props>
      <PC-InfoData>
      </PC-InfoData>

      <PC-InfoData>
        <PC-InfoData_urn>
          <PC-Urn>
          </PC-Urn>
        </PC-InfoData_urn>
        <PC-InfoData_value>
          <PC-InfoData_value_binary>00000371E0723800000000000000000000
+000000000000000000000000000000000000000000001E00000000000814E18006020
+803000400080000900800000000000000000001080000020014008000070000052000
+1000002400000000000000000000000000000000000000000000000000</PC-InfoDa
+ta_value_binary>
        </PC-InfoData_value>
      </PC-InfoData>
      
      <PC-InfoData>
      </PC-InfoData>
      
  </PC-Compound>
[download]

Comment on 1GB XML mining with XML:twig (newbies question) Select or Download Code

Replies are listed 'Best First'.

Re: 1GB XML mining with XML:twig (newbies question)
by pc88mxer (Vicar) on Feb 16, 2008 at 01:28 UTC

This seems like a good job for line parsing. From the example fragment you have posted it seems like the XML file is very regular in its structure. If that is the case, I would stream in the file reading one <PC-Compound> element at a time like this:

my @compound;
while (<IN>) {
    if (m/^\s*<PC-Compound>/) {
        @compound = ($_);
    } elsif (m/^\s*<\/PC-Compound>/) {
        push(@compound, $_);
        process_compound();
        @compound = ();
    } else {
        push(@compound, $_) if (@compound);
    }
}
[download]

When process_compound() is called, the array @compound will have the lines for one <PC-Compound> record which you can process with XML::twig or some other XML module. (Also, instead of pushing lines onto an array, you could also append to a string buffer if that's more convenient.)

Another option is to use something like XSLT to extract the records of interest, but that's a whole other technology.

[reply]
[d/l]
[select]

Re: 1GB XML mining with XML:twig (newbies question)
by graff (Chancellor) on Feb 16, 2008 at 16:08 UTC

The trouble is that extracting the records of interest(PC-CompoundType_id_cid,PC-InfoData_value_binary) the way I do can lead to troubles when the item for some compound is missing
...
my $root= $twig->root;
my @cpds= $root->children_text($field1);
my @bins= $root->children_text($field2);
[download]

So, do you want to limit your processing to just those blocks that have both elements, and skip the others? There's probably a clever way to do that using just the resources and methods provided by XML::Twig, but my first instinct (esp. considering the size of the input file), would be to handle this matter at the stage of reading the data from the file.

Here's something similar to what was proposed in the first reply, to process only the relevant blocks of data (remember, I'm just trying to guess at what you are really trying to do -- apologies if I guessed wrong):

my $filename = "whatever...";
{
    open( my $fh, "<", $filename ) or die "$filename: $!";
    local $/ = "</PC-Compound>";

    while (<$fh>) {   # read one entire PC-Compound block
        next unless ( m{\w\s*</PC-CompoundType_id_cid} and
                      m{\w\s*</PC-InfoData_value_binary} );

        s/^.*?(?=<PC-Compound)//;  # remove anything that precedes the
+ block
        process_compound( $_ );
    }
}
[download]

The "process_compound" sub can use your favorite XML parsing module on the string that is passed to it.

Looking more closely at your sample of XML input data, it seems like there could be cases where a single "PC-Compound" block (having one "PC-CompoundType_id_cid") could have two or more "PC-InfoData_value_binary" fields. What are you supposed to do if that happens?

[reply]
[d/l]
[select]

Re^2: 1GB XML mining with XML:twig (newbies question)

by Anonymous Monk on Feb 16, 2008 at 19:08 UTC

my $twig= new XML::Twig(
                    twig_handlers =>                  
                      { PC-Compound => \&subrutineforparsing}
                           );
    $twig->parsefile($inputfile);
[download]

[reply]
[d/l]

Re^3: 1GB XML mining with XML:twig (newbies question)

by karpatov (Beadle) on Feb 18, 2008 at 16:00 UTC

Hmm. My solution worked. But was desperately slow and runout of memory errors happend. So I decided to use your strategy (Regex and only then xml-parser) and it is great. Tx. karpatov

[reply]

Re: 1GB XML mining with XML:twig (newbies question)
by Jenda (Abbot) on Feb 18, 2008 at 15:52 UTC

use strict;
use XML::Rules;

my $parser = XML::Rules->new(
    start_rules => [
        'PC-InfoData_urn,PC-Compound_atoms' => 'skip',
    ],
    rules => [
        _default => 'as is',
        'PC-CompoundType_id_cid,PC-InfoData_value_binary' => 'content'
+,
        'PC-InfoData' => sub {
            return unless $_[1]->{'PC-InfoData_value'}{'PC-InfoData_va
+lue_binary'};
            return '@InfoData' => $_[1]->{'PC-InfoData_value'}{'PC-Inf
+oData_value_binary'};
        },
        'PC-Compound' => sub {
            my $id = $_[1]->{'PC-Compound_id'}{'PC-CompoundType'}{'PC-
+CompoundType_id'}{'PC-CompoundType_id_cid'}
                or return; # no ID found
            my $data = $_[1]->{'PC-Compound_props'}{'InfoData'}
                or return; # no data
            return $id => $data;
        },
        'PC-Compounds' => 'pass',
    ],
    stripspaces => 7,
);

my $data = $parser->parse(\*DATA);

use Data::Dumper;
print Dumper($data);


__DATA__
<PC-Compounds>
<PC-Compound>
    <PC-Compound_id>
      <PC-CompoundType>
        <PC-CompoundType_id>
          <PC-CompoundType_id_cid>1</PC-CompoundType_id_cid>
        </PC-CompoundType_id>
      </PC-CompoundType>
    </PC-Compound_id>
    <PC-Compound_atoms>
    </PC-Compound_atoms>
    <PC-Compound_props>
      <PC-InfoData>
      </PC-InfoData>

      <PC-InfoData>
        <PC-InfoData_urn>
          <PC-Urn>
          </PC-Urn>
        </PC-InfoData_urn>
        <PC-InfoData_value>
          <PC-InfoData_value_binary>00000371E0723800000000000000000000
+0000000000
+00000000000000000000000000000000001E00000000000814E180060208030004000
+8000090080
+000000000000000000108000002001400800007000005200010000024000000000000
+0000000000
+0000000000000000000000000000</PC-InfoData_value_binary>
        </PC-InfoData_value>
      </PC-InfoData>

      <PC-InfoData>
      </PC-InfoData>
    </PC-Compound_props>
  </PC-Compound>
</PC-Compounds>
[download]

If the XML really looks like this, then this code will give you a reference to a hash of arrays, the keys of the hash will be the values of the <PC-CompoundType_id_cid> and the values of the hash will be arrays of the values of <PC-InfoData_value_binary>. And it will handle the cases of <PC-Compound>s without the ID or the <PC-InfoData_value_binary>.

If there's more data in the file you may add some more tags into the skip-list in the start_rules and maybe even add special rules for some of the tags between <PC-Compound> and <PC-CompoundType_id_cid> or <PC-InfoData_value_binary> to get rid of the child tags and attributes you are not interested in.

The rules specify what data from each branch is to be kept and what data is to be forgotten. The 'content' means we want only the textual content of a tag (no attributes, no subtags), 'as is' means 'remember all data', 'pass' means to remove that tag and add all its data into its parent tag (similar to transforming <R><a>aaa</a>bbb<c>ccc</c></R> to <R><a>aaa</a>bbb<c>ccc</c></R>). The rule for <PC-InfoData> forgets all tags that do not contain the data we are interested in and otherwise adds the data into an array named InfoData within the <PC-Compound_props>'s data. Finaly the rule for <PC-Compound> takes the ID from several tags below, takes the data and if both are present adds them to its parent tag's data using the ID as the key (normally an attribute or subtag name) and the data as the value.

And the great thing is that before you start parsing the next <PC-Compound>, the only things from the previous one still in memory are the ID and the binary value you are interested in.

HTH, Jenda
Support Denmark!
Defend the free world!

[reply]
[d/l]
[select]