in reply to XML processing taking too much time

You can find PPMs for XML::Twig in Kobe's repository: http://cpan.uwinnipeg.ca/module/XML::Twig.

That said, I don't know how you load a 4gb file in XML::DOM, how much memory do you have on that machine? 40gb? With XML::Twig you can process parts of the XML (twigs as opposed to the whole tree ;--) so you should be able to keep memory usage lower, maybe much lower, depending on what you need to do, which should speedup processing. But if you can load the entire tree in memory and you can install libxml2, then you can also try using XML::LibXML, porting the XML::DOM code would be much easier in this case, and XML::LibXML is much faster than XML::DOM.

  • Comment on Re: XML processing taking too much time

Replies are listed 'Best First'.
Re^2: XML processing taking too much time
by koti688 (Sexton) on Mar 26, 2009 at 09:11 UTC
    Hmm Yes. My memory is 2Gb only.:(

    My Xml contains Multiple blocks of data . one block is like below.

    <SigData>
    <KVPair>
    <Key>eb08f9990ae6545f9ea625412c71f24f7bf007ed</Key>
    <Value>c73df5228c35c419f884ba9571310cd7</Value>
    </KVPair>
    </SigData>


    i need to load these elements <key>,<value> of the tree into these arrays like

    my @keys = getValuesFromPath($sigData ,"/SigData/KVPair/Key");
    my @values = getValuesFromPath($sigData ,"/SigData/KVPair/Value");

    So you want me to use XML::LibXML also along with XML::Twig???

      I was just surprised that you could use XML::DOM at all on files of that size. And it looks like you can't actually, a 1gb XML file would take at least 8gb in memory using XML::DOM. So it might be interesting to know how you did it. What I meant was that if you had been able to do it, by throwing large amounts of memory at the problem, then XML::LibXML would have been an option.

      With XML::Twig you can very easily extract the k/v pairs:

      my $t= XML::Twig->new( twig_roots => { SigData => sub { push @keys, $_->field( 'Key'); push @values, $_->field( 'Value'); $_->purge; } }, ) ->parsefile("my_big_fat_xml_file.xml");

      Of course the @keys and @values arrays are going to be huge too, so you might still want to add a few GB of RAM to your machine, but at least the XML structure will never take up more than a few bytes.

      Other possible options are XML::Rules (I expect jenda to show up and give you an example as soon as he wakes up, and maybe the new XML::Reader, which seems quite appropriate. XML::LibXML's pull mode might also be appropriate, but I have never used it so I can't comment on it.

        :-))

        If you are sure each <KVPair> contains both <Key> and <Value> and is always in <SigData> you can use something as simple as this:

        use XML::Rules; my (@keys, @values); my $parser = XML::Rules->new( rules => { _default => '', Key => sub {push @keys, $_[1]->{_content}}, Value => sub {push @values, $_[1]->{_content}}, }, ); $parser->parse(\*DATA); use Data::Dumper; print Dumper(\@keys); print Dumper(\@values); __DATA__ <root> <SigData> <KVPair> <Key>eb08f9990ae6545f9ea625412c71f24f7bf007ed</Key> <Value>c73df5228c35c419f884ba9571310cd7</Value> </KVPair> <bogus>sdf sdhf nsdfg sdfgh nserg sfgdfgh</bogus> </SigData> <SigData> <KVPair> <Key>EB08F9990AE6545F9EA625412C71F24F7BF007ED</Key> <Value>C73DF5228C35C419F884BA9571310CD7</Value> </KVPair> </SigData> </root>

        If there is more in the XML you may skip some tags and their children by adding

        start_rules => { 'the,list,of,such,tags' => 'skip' },
        into the XML::Rules constructor.

        If you do not want to use the globals, you may do something like:

        my $parser = XML::Rules->new( stripspaces => 3, rules => { _default => '', Key => 'content', Value => 'content', KVPair => 'pass', SigData => sub {return '@keys' => $_[1]->{Key}, '@values' => $ +_[1]->{Value}}, root => 'pass', }, ); my $data = $parser->parse(\*DATA); use Data::Dumper; print Dumper($data);
        (assuming there is exactly one <KVPair> in each <SigData>! You'd have to add a test if it was optional.).

        Actually are you sure you want to build two interrelated arrays? Wouldn't it make more sense to create a single hash? Or maybe process the pair as soon as you read it instead of keeping them all in memory?

        The first would be

        my $parser = XML::Rules->new( stripspaces => 3, rules => { _default => '', Key => 'content', Value => 'content', KVPair => sub {return $_[1]->{Key} => $_[1]->{Value}}, SigData => 'pass', root => 'pass', }, ); my $data = $parser->parse(\*DATA);
        the other just means that you change the anonymous subroutine specified in the rule for <KVPair> or <SigData> to do the processing and to return nothing. That way you only need memory proportional to the size of the individual keys and values.

        Thanks a lot Mirod . I will try the way you suggested. Seems i need to change my whole sturcture. I will let you know , the happenings.

        Thanks Again Koti.