Re: 1GB XML mining with XML:twig (newbies question)

Please elaborate on what kind of troubles you run into. Running out of memory comes to mind -- are there any other problems?

This seems like a good job for line parsing. From the example fragment you have posted it seems like the XML file is very regular in its structure. If that is the case, I would stream in the file reading one <PC-Compound> element at a time like this:

my @compound;
while (<IN>) {
    if (m/^\s*<PC-Compound>/) {
        @compound = ($_);
    } elsif (m/^\s*<\/PC-Compound>/) {
        push(@compound, $_);
        process_compound();
        @compound = ();
    } else {
        push(@compound, $_) if (@compound);
    }
}
[download]

When process_compound() is called, the array @compound will have the lines for one <PC-Compound> record which you can process with XML::twig or some other XML module. (Also, instead of pushing lines onto an array, you could also append to a string buffer if that's more convenient.)

Another option is to use something like XSLT to extract the records of interest, but that's a whole other technology.

Comment on Re: 1GB XML mining with XML:twig (newbies question) Select or Download Code