my xml xpath is too slow.

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks, my codes run but it is extremely slow. any suggestion to improve it?


use strict;

use XML::Parser;

use XML::XPath;

use Lingua::StopWords qw( getStopWords );

my $stopwords = getStopWords('en');



my $file = $ARGV[0];

my $xp = XML::XPath->new(filename=>$file);


for (my $n = 1; $n <= 600; $n++) {


my $textnodeset = $xp->find('//pair[@id = '.$n.']/tAnnotation/tree/nod
+e/word/attribute[@name="token"]');

my @texts;

if (my @textnodelist = $textnodeset->get_nodelist) {

@texts = map($_->string_value, @textnodelist);}




my %seent;
my @uniqt = grep !$seent{$_}++, @texts;

my $hyponodeset = $xp->find('//pair[@id = '.$n.']/hAnnotation/tree/nod
+e/word/attribute[@name="token"]');


my @hypos;

if (my @hyponodelist = $hyponodeset->get_nodelist) {

@hypos = map($_->string_value, @hyponodelist);}


my %seenh;
my @uniqh = grep !$seenh{$_}++, @hypos;


my @termst = grep { ! $stopwords->{ $_ } } @uniqt;


my @termsh = grep { ! $stopwords->{ $_ } } @uniqh;



for my $i (0 .. $#termst) {
   for my $j ( 0 .. $#termsh) {
  print "$termst[$i] $termsh[$j]\n";
}}
}
[download]

my input is an xml file and my output is a pairs of extracted items per line. any idea to improve? thanks,.

Comment on my xml xpath is too slow. Download Code

Replies are listed 'Best First'.
Re: my xml xpath is too slow. by mirod (Canon) on Aug 26, 2009 at 09:22 UTC
The first suggestion that comes to my mind is to ditch XML::XPath and replace it with XML::LibXML. That will speedup your code, and you will be using a module that is actually maintained. Then of course the `//pair[@id = '.$n.']` is a big red flag. Especially as you seem to be doing this twice for each `$n` (it's hard to tell as your code is not indented). Either try to have the complete path above `pair`, or do the search on `//pair` only once, cache the results in a hash `id => node` and use this in the rest of your code. Does this make sense?	[reply] [d/l] [select]
Re^2: my xml xpath is too slow. by Anonymous Monk on Aug 26, 2009 at 09:38 UTC
Thanks for reply, I never used LibXML, do you think it would work faster? I have to say with the full path for //pair ( /e-c/pair instead of //pair), it did not help for speed. Moreover since it's pair I have to extract it in to different process, as it is in my code.	[reply]
Re^3: my xml xpath is too slow. by mirod (Canon) on Aug 26, 2009 at 09:47 UTC
XML::LibXML would certainly run much faster, XML::XPath is quite slow. and the code would be very similar. I am surprised that removing the // did not speed up the code though.	[reply]
Re: my xml xpath is too slow. by grizzley (Chaplain) on Aug 26, 2009 at 14:27 UTC
Wouldn't it be easier to write it (and read it) this way?: use strict; use XML::Parser; use XML::XPath; use Lingua::StopWords qw( getStopWords ); my $stopwords = getStopWords('en'); my $file = $ARGV[0]; my $xp = XML::XPath->new(filename=>$file); for my $n(1..600) { my %seent; my $textnodeset = $xp->find('//pair[@id = '.$n.']/tAnnotation/tree +/node/word/attribute[@name="token"]'); my @texts = map $_->string_value, $textnodeset->get_nodelist; my @termst = grep{!$seent{$_}++ && !$stopwords->{$_}} @texts; my %seenh; my $hyponodeset = $xp->find('//pair[@id = '.$n.']/hAnnotation/tree +/node/word/attribute[@name="token"]'); my @hypos = map $_->string_value, $hyponodeset->get_nodelist; my @termsh = grep{!$seenh{$_}++ && !$stopwords->{$_}} @hypos; for my $i (@termst) { for my $j (@termsh) { print "$i $j\n" } } } [download] You could search google for 'xpath unique', so that xpath returns only unique values. But I am not sure if it will speed things up or slow them down...	[reply] [d/l]
Re: my xml xpath is too slow. by Jenda (Abbot) on Sep 03, 2009 at 21:44 UTC
Hard to say without seeing the XML, but I would use XML::Rules to trim and simplify the data before entering the look. Jenda Enoch was right! Enjoy the last years of Rome.	[reply]