Memory Efficient XML Parser

perlgoon has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.

Re: Memory Efficient XML Parser
by moritz (Cardinal) on Dec 11, 2007 at 18:04 UTC

XML::Twig

Re: Memory Efficient XML Parser
by plobsing (Friar) on Dec 11, 2007 at 20:05 UTC

XML::SAX	a little low level, but highly customizable
XML::STX	A comfortable choice for those familiar with XSLT
XML::Twig	Previously mentioned, very perlish
XML::Records	Your first consideration. From what I can tell, its a simplified version of XML::Twig

[reply]

Re^2: Memory Efficient XML Parser

by eserte (Deacon) on Dec 11, 2007 at 21:09 UTC

XML::LibXML

[reply]

Re: Memory Efficient XML Parser
by KurtSchwind (Chaplain) on Dec 12, 2007 at 02:15 UTC

XML parsing is always a bit of a pig. Even so, I really question this situation. You must have some seriously complex XML.

100M files? When you said big I thought you were going to say a GIG or two. 100M? I've hit nearly 100M for my iTunes library xml file. Do you mind if I ask how much system memory do you have? Are you sure this is an XML issue? Can you describe the issues you are experiencing in a bit more detail?

--
I used to drive a Heisenbergmobile, but every time I looked at the speedometer, I got lost.

[reply]

Re^2: Memory Efficient XML Parser

by perlgoon (Initiate) on Dec 12, 2007 at 21:02 UTC

for my $subpkg (@$pkg)
{
    $sql .= "$delim($subpkg->{field1},$subpkg->{field2},$subpkg->{fiel
+d3})";
    $delim = ",";
}
$query = $db->do($sql);
[download]

[reply]
[d/l]

Re^3: Memory Efficient XML Parser

by Jenda (Abbot) on Dec 13, 2007 at 12:21 UTC

Wait a second, could you show us a bit more of the code? It looks as if you were first extracting all data in the XML, building a huge string and then tried to shove the whole string into the database. I'm not surprised the script and the database need a lot of memory and CPU time to cope with that!

It would be much better to parse one row, insert it to the database using a prepare()d statement handle, forget its data, parse the next one ... and if you want to optimize it and don't mind that it's a tiny little bit more complex, open the database connection with AutoCommit=>0 and commit only after every 1000 rows (you may need to do some benchmarking to find the right number here).

Jenda
Support Denmark!
Defend the free world!

[reply]

Re^4: Memory Efficient XML Parser

by perlgoon (Initiate) on Dec 18, 2007 at 16:53 UTC

Re^5: Memory Efficient XML Parser

by Jenda (Abbot) on Dec 18, 2007 at 18:26 UTC

Re^2: Memory Efficient XML Parser

by Jenda (Abbot) on Dec 13, 2007 at 13:57 UTC

There is no reason why XML parsing has to be a "pig" ... or to use a better defined term ... a memory hog. It only is if you first parse the whole XML and create a huge data structure or a huge maze of objects. While at times this is what you have to do or what's most convenient to do, it's not the only solution. And often it's even not the easiest solution. It's quite possible and often quite convenient to process XML in chunks using something like XML::Twig or XML::Record or specify what parts of the XML are you actually interested in and which ones can be ignored, buils a specialized datastructure as you parse the data and (if convenient) handle the chunks with XML::Rules.

Neither will continue eating up memory as the XML grows.

Jenda
Support Denmark!
Defend the free world!

[reply]

Re^3: Memory Efficient XML Parser

by eserte (Deacon) on Dec 13, 2007 at 21:23 UTC

use constant RECS => 1000000;
{
    open my $fh, ">/tmp/bla.xml" or die;
    select $fh;
    print "<addresses>\n";
    for (1..RECS) {
    print <<EOF;
  <address>
     <name>John Smith</name>
     <city>London</city>
  </address>
EOF
    }
    print "</addresses>\n";
}

{
    require Storable;
    my @addresses;
    for (1..RECS) {
    push @addresses, { name => "John Smith", city => "London" };
    }
    Storable::nstore(\@addresses, "/tmp/bla.st");
}
[download]

$ ( set time = ( 0 "%U+%S %E %MK" ) ; time perl -MStorable -e 'retriev
+e "/tmp/bla.st"' )
1.980+0.384 0:02.41 193974K
$ ( set time = ( 0 "%U+%S %E %MK" ) ; time perl -MXML::LibXML -e 'XML:
+:LibXML->new->parse_file("/tmp/bla.xml")->documentElement' )
6.037+1.876 0:08.15 643952K
[download]

[reply]
[d/l]
[select]

Re^4: Memory Efficient XML Parser

by Jenda (Abbot) on Dec 14, 2007 at 00:34 UTC

Re^5: Memory Efficient XML Parser

by eserte (Deacon) on Dec 14, 2007 at 20:57 UTC