XML::Twig questions

r1_fiend has asked for the wisdom of the Perl Monks concerning the following question:

Hello, I'm really hoping someone here can help me out with this...essentially I'm writing a script to search through a huge (400 MB) XML document for nodes where a certain child element's text matches some string. I have tried approaching this by loading each full node into memory, checking the appropriate child node's text, and either printing it or purging it based on a match or not. However, this ends up consuming a huge amount of memory, as there are over 400,000 records, and the purge method still keeps the reference of the root->main_element in memory. Also, it takes forever, and my computer slows down drastically.

Here's a condensed version of that approach...(FullSearch.pl)

==============================================

use strict;
use XML::Twig;

my $file = "bin/Example.xml";
my $node = $ARGV[0];
my $value = $ARGV[1];
my $i = 0;

my $twig= new XML::Twig( twig_handlers => { Record => \&Record });

$twig->parsefile($file);
print "Finished Search!\n";

sub Record { 
my( $twig, $record)= @_;
    my @matcharray= $record->get_xpath($node);
    for my $match (@matcharray[-1]){
    if( $match->text eq $value){
# PRINT ALL NODES FROM THIS RECORD HERE...
}}
$twig->purge;
}
[download]

==============================================

Should that not take up a ton of system resources?

So I thought, what if I used a start_tag_handler to keep track of which main_element record I'm on, then use twig_roots to only compare the search value against the specific child node I'm interested in. If there is a match, store the record number, otherwise, purge. This seems to work a whole lot faster, and consumes nominal memory.

Here's that approach...(MarkMatches.pl)

==============================================

use strict;
use XML::Twig;

my $file = "bin/Example.xml";
my $node = $ARGV[0];
my $value = $ARGV[1];
my $count = 0;

open(INDEX, ">bin/Index.txt");

my $twig = XML::Twig->new( 
start_tag_handlers => { Record => \&Count },
twig_roots => { $node => \&Record } );

$twig->parsefile($file);
$twig->purge;

sub Count {
my ( $twig, $element )=@_;
$count++;
$twig->purge;}

sub Record {
my ( $twig, $record )=@_;
if($record->text eq $value) {
    print INDEX $count."\n"; }
$twig->purge; }
[download]

==============================================

Right now, I am saving a list of matching records to a separate text file, then pulling that file into an array in a second script. This script does a start_tag_handler method again to get the current record count, and either lets it get processed by the twig_handlers to print the record if the count is in the array, or ignores it altogether (using no memory).

Here's that script...(ReturnRecords.pl)

==============================================

use strict;
use XML::Twig;

my $file = "bin/Example.xml";
my $i;
my $count = 0;

my $data_file = "bin/Index.txt";
open(DAT, $data_file) || die("Could not open file!");
my @id=<DAT>; 
close(DAT);

my $twig = XML::Twig->new(
start_tag_handlers => { Record => \&Check },
twig_handlers => { Record => \&Process }
);

$twig->parsefile($file);
$twig->purge;

sub Check { 
my ($twig, $record)=@_;
$i = 0;
$count++;
if ($count > $id[-1]){
    $twig->finish_now;}
my $thisId = $count."\n";
foreach my $id (@id){
if ($thisId eq $id){ 
    $i = 1; 
    last;}}
if ( $i ne 1){ 
    $record->ignore; }
}

sub Process { 
my ($twig, $record)=@_;
# PRINT ALL NODES FROM RECORD HERE...
$twig->purge;
}
[download]

==============================================

Is there a way I can combine these two scripts somehow? Could I maybe start rescanning the document within the first script? If I get a match with twig_roots, the entire twig consists of just the direct path to that one node. Is there a way to fully parse nodes where twig_roots matches? I can go backwards from the matching twig_root to the parent, but the only descendants available are the ones that made up the twig to begin with.

Also, all of the main record elements have an ID associated with them (see below). Is there a way to jump to a matching ID without scanning each one, even without using start_tag_handlers and comparing the ID?

Here's how the XML file is set up...(Example.xml)

==============================================

<Root>
   <Record id="1">
      <Title>Title 1<\Title>
      <Year>2007<\Year>
      <Author>W. T. Wright<\Author>
      <Copyright>
         <Year>2006<\Year>
         <Number>A84LEU<\Number>
      <\Copyright>
      …
      <Info>Blah Blah Blah<\Info>
   <\Record>
   …
   <Record id="429000">
      <Title>Title 999<\Title>
      <Year>2004<\Year>
      <Author>A. R. Smith<\Author>
      <Copyright>
         <Year>2003<\Year>
         <Number>D93YAK<\Number>
      <\Copyright>
      …
      <Info>Halb Halb Halb<\Info>
   <\Record>
<\Root>
[download]

Sorry this is so long. I've been working on optimizing this for a while, and just can't seem to get it to be as efficient as I need it to be. Right now it takes about 10-15 minutes to scan the document each time, or 20-30 minutes doing the "full search" (but the PC freezes, so that's not even an option).

Thanks!

Comment on XML::Twig questions Select or Download Code

Replies are listed 'Best First'.

Re: XML::Twig questions
by Jenda (Abbot) on Aug 30, 2008 at 15:08 UTC

I know you asked for a XML::Twig solution, but it would not be me if I did not suggest a XML::Rules one instead. It would be something like this:

use strict;
use XML::Rules;
my $parser = XML::Rules->new(
  stripspaces => 7,
  rules => {
   _default => 'content',
   Copyright => 'no content',
   # ...
  Record => sub {
    return unless $_[1]->{Title} eq 'Something';
    print "The stuff in the %{$_[1]} hash containing the attributes an
+d subtag data.\n";
    return;
  }
  }
)

$parser->parse($file);
[download]

The rules will allow you to ignore the subtags you do not need so they will not even take up memory. Plus you only ever have one <Record> in memory anyway. If the Record tag is more complex you may want to use the XML::Rules->inferRulesFromExample() to get the rules

Jenda
Support Denmark!
Defend the free world!

[reply]
[d/l]

Re: XML::Twig questions
by dHarry (Abbot) on Aug 30, 2008 at 15:18 UTC

1. Please use <code> and <readmore> tags. Your code is interpreted as markup and has some weird stuff in it.

2. I have tried approaching this by load each full node into memory, checking the appropriate child node's text, and either printing it or purging it based on a match or not.

This seems unnecessary to me, see 4.

3. Is there a way I can combine these two scripts somehow?

I would say yes but then again it’s not really clear to me what you try to achieve. There are several ways of using XML:Twig and you seem to mix things up a little bit. What exactly are you trying to do (from a functional point of view)?

4. Is there a way to jump to a matching ID without scanning each one, even without using start_tag_handlers and comparing the ID?

You can "jump" using xpath expressions. The parser has to go through the file anyway of course but you probably don’t have to built all the Twigs in memory.

...
twig_handlers => { Record[@id="429000"] => \&Record });
[download]

5. Right now it takes about 10-15 minutes to scan the document each time, or 20-30 minutes doing the "full search" (but the PC freezes, so that's not even an option).

The 10-15 minutes doesn’t sound too bad to me. XML::Twig is implemented the OO way and you generate lots of method calls. I have used XML::Twig on XML files up to 700 MB and then the times go up further. If speed is really a big issue for you can try to optimize. See Speedup for an approximately 30% gain.

Maybe brother mirod can shed his light on it. He wrote the stuff and knows it inside out.

Hope this helps

[reply]
[d/l]

Re: XML::Twig questions
by mirod (Canon) on Aug 30, 2008 at 17:29 UTC

Your question is not really clear to me (maybe it's because it's the weekend). Without the arguments you use to call the script, I don't see exactly waht it is you are trying to achieve here.

In case this is what you need (and after reading Jenda's ob. XML::Rules plug ;--) maybe have a look at the ignore method (or the ignore_elts option for new, which needs some more docs I reckon).

If that doesn't help, maybe, as previously suggested, a higher level description of what you are trying to do would help.

[reply]

Re: XML::Twig questions
by cutlass2006 (Pilgrim) on Aug 30, 2008 at 19:13 UTC

Anything with XML::Parser should be faster then XML:Twig.

outside of perl, other routes would include using SAXON XSLT/XQUERY Processor which is highly optimized for this kind of thing or considering an XML Database like eXist.

[reply]

Re: XML::Twig questions
by Tanktalus (Canon) on Aug 31, 2008 at 04:14 UTC

That's not REALLY your XML, is it? (Did you just make stuff up instead of copying and pasting?) Hint: XML usually uses "/" instead of "\". In Windows, the two are interchangeable for path separators. Not so in unix or XML.

That said, you have a hierarchical database. It's big. And you want to load, parse, and query it in a subprocess. That sounds like a recipe for slowness.

Instead, I would do the following. First, I would implement the naive XML::Twig implementation. I'd load the whole sucker into RAM, and have it available for queries. Then I'd set it up as a daemon, probably with Net::Server. And then the subprocess that you're currently using would just connect to the daemon, send the query, and the daemon would use that to look up in the in-memory cache, and return the value (see Storable for sending data from one process to another, especially if they're on the same machine which means they should be using the same level of perl). This theory is based on the assumption that it's the loading and parsing of XML that takes the longest. Then I'd see if the performance was acceptable. If not, plan B. (Though, if it's just swapping problems, add more RAM.)

The next option is to hand the entire piece of work over to a more generic hierarchical database. If no hierarchical database is available, you may be able to use a separate program to parse the XML and load a relational database, though I hear that DB2 has a new "pureXML" ability which allows it to shred XML right into the database and give you an SQL interface (other vendors may have something similar, I don't know). This would be more expensive (unless pureXML is available with their Express-C option, I don't know that, either), but it's likely to work fairly quickly. And probably a lower RAM requirement than my first option above. The other expensive part is switching your mindset over to an SQL-like method of querying instead of trying to do it all in one process. If this also doesn't have acceptable performance (either relationally or hierarchically), you probably have requirements that are going to be hard to meet in your current hardware setup.

[reply]

Re: XML::Twig questions
by Perlbotics (Archbishop) on Aug 30, 2008 at 21:59 UTC

The example suggests, you are dealing with some kind of library information. If that example is real, I assume this kind of information is rather static?

Maybe you are better off importing the XML stuff into a "real" database and later on add/remove only the changes? Furthermore, I assumed that this is not a one-time activity for a given XML document. Even when your XML generator/tool can generate fulldumps only (400MB), adding/removing the changes between two report periods should be possibly faster than importing the whole stuff again...?

It might cost you in total > 800 MB additional disk space (w/o compression). Well, lots of assumptions so far...

[reply]

Re: XML::Twig questions
by r1_fiend (Initiate) on Aug 31, 2008 at 02:51 UTC

Is XML::Twig even the best approach for this problem? I want this to run quickly and not use a ton of system resources.

[reply]

Re^2: XML::Twig questions

by Jenda (Abbot) on Aug 31, 2008 at 15:24 UTC

In that case importing the data into a database (most likely DBD::SQLite would be enough), adding a few indexes and searching there would be both quickest and easiest. And if you do need the data in XML format you can export the records you find that way fairly easily.

Otherwise you end up parsing and reparsing the file over and over again which will be slow.

Jenda
Support Denmark!
Defend the free world!

[reply]