Heh, just a week later and I kicked myself hard once and again to finally read the PDF paper in earnest, and, of other sources, only a less cruel text linked in a footnote in said PDF. The blog entry, linked from PWC #145 task #2, didn't open for me, which I didn't pursue. I.e. I didn't consult anyone's code before writing this implementation (speed, clarity, interface have, of course, room for improvement):
package Eertree;
use strict;
use warnings;
use Exporter 'import';
our @EXPORT_OK = qw/ ee /;
sub ee {
my $str = shift;
my $FULL = shift; # if true: return "nodes" arrayref "as is", th
+en DIY:
# extract and expand to palindrome complete
+list
# (it's trivial, isn't it?), work with each
# palindrome's position, do statistics, etc.
# if false: just return unique palindromes as
+a list
# (note: it's faster because there's less in
+fo
# collected)
my @nodes = (
{ len => -1, lnk => 0 },
{ len => 0, lnk => 0 }
);
my $lpspi = 0; # "longest proper suffix-palindrome index"
# (of substring processed so far)
for my $i ( 0 .. length( $str ) - 1 ) {
my $c = substr $str, $i, 1;
my $current = $lpspi;
my $n = $nodes[ $current ];
while ( $c ne substr $str, $i - $n-> { len } - 1, 1 ) {
$current = $n-> { lnk };
$n = $nodes[ $current ]
}
my $suffix; # trailing palindrome index
# after adding current character
if ( $current == 0 ) { # lone character "palindrome"
# (note: this and following branches have some code in com
+mon i.e.
# aren't dry, I didn't test whether consolidation would im
+pair
# speed, but I think a couple of vague conditionals, inste
+ad,
# could impair clarity
if ( exists $n-> { edges }{ $c }) { # already seen?
$suffix = $n-> { edges }{ $c };
push @{ $nodes[ $suffix ]{ offsets }}, $i if $FULL
}
else {
my $p = { len => 1, lnk => 1, offsets => [ $i ]};
push @nodes, $p;
$n-> { edges }{ $c } = $suffix = $#nodes
}
}
else {
if ( exists $n-> { edges }{ $c }) { # already seen?
$suffix = $n-> { edges }{ $c };
_descend( $suffix, $i, \@nodes ) if $FULL
}
else {
my $L = 2 + $n-> { len };
my $p = {
len => $L,
offsets => [ $i - $L + 1 ],
};
push @nodes, $p;
$n-> { edges }{ $c } = $suffix = $#nodes;
do {
$current = $n-> { lnk };
$n = $nodes[ $current ];
} while ( $c ne substr $str, $i - $n-> { len } - 1, 1
+);
$p-> { lnk } = $n-> { edges }{ $c };
_descend( $p-> { lnk }, $i, \@nodes ) if $FULL
}
}
# make sure it's "proper"
$lpspi = ( $i + 1 ) == $nodes[ $suffix ]{ len }
? $nodes[ $suffix ]{ lnk }
: $suffix;
}
return \@nodes if $FULL;
splice @nodes, 0, 2;
return map substr( $str, $_-> { offsets }[ 0 ], $_-> { len }), @no
+des
}
sub _descend {
my ( $x, $i, $nodes ) = @_;
while ( $x > 1 ) {
my $n_ = $nodes-> [ $x ];
push @{ $n_-> { offsets }}, $i - $n_-> { len } + 1;
$x = $n_-> { lnk }
}
}
1;
Then I checked PWC #146 i.e. next one, for recap; there were _very_ few blog links (it was New Year's vacation time, unfortunately), and only _one_ says its Perl is Python's port of _real_ eertree. This one I ran and found just too slow for short inputs; then it gave "deep recursion" (what???) warnings for still too short but slightly longer input. I terminated the process it took too long. Then I checked only 2 more random GH directory PWC #145 entries; didn't find eertree there neither. Found a funny excuse _not_ to do an eertree :-). I'd be _really_ grateful if someone points me to a GH PWC #145 Perl's eertree code.
Which leaves String::Eertree only, I think. With a harness as in (grand)*parent node:
my $s = join "", map chr(ord('a') + int rand 26), 1 .. 200;
cmpthese(-3, {
eertree => sub { 'String::Eertree'->new(string => $s)->uniq_palind
+romes },
rosetta_new => sub { rosetta_fixed($s) },
ee => sub { ee($s) },
});
Rate rosetta_new eertree ee
rosetta_new 271/s -- -58% -93%
eertree 651/s 140% -- -84%
ee 3964/s 1362% 509% --
Which, I guess, is due to "Perl's subs/methods calls are too expensive" (_lots_ of calls in S::E internally) and/or "Moo"? or something else? But wait, comparison above is unfair ("give all advantage etc.") because 'ee' skips some work if only unique palindromes are required. Let's rather do this:
cmpthese(-3, {
eertree => sub { 'String::Eertree'->new(string => $s) },
ee => sub { ee($s, 1) },
});
Rate eertree ee
eertree 648/s -- -80%
ee 3199/s 394% --
I.e. contestants consume their strings, then build internal structures as required, without producing unique or non-unique lists (note: 'ee' also provides offsets for each (non-unique) pali which S::E doesn't. Not too much work, but still...) Yeah, that's more fair.
In fact random chars, very short (200 _is_ short) strings don't look interesting. I tried what follows with (what looked at the time as) more entertaining input -- a string up to 400+k chars with "real" pali randomly interspersed within. Maybe it's as uninteresting as previous. The purpose was to also check the linearity i.e. O(n)-ness which, as I squinted in a skewed way (guess in which favour) didn't look right to me at first. False alarm, both appear O(n) ("X" - thousands of repetitions of known pali (randomly placed) in the input; Y -- seconds; A -- ee, B -- S::E). But perhaps an input can be constructed which results in catastrophic slowdown in A and/or B. I have no proof i.e. can't write scientific PDF papers.
+
2.5 +-------------------------------------------------+
| + + + + + + + + + |
| |
| |
| B |
| |
| B |
2 |-+ +-|
| B |
| B |
| |
| B |
| |
| B |
1.5 |-+ B +-|
| |
| B |
| |
| B |
| |
| B |
| B |
1 |-+ +-|
| B |
| |
| B |
| B A |
| A A |
| B A A |
0.5 |-+ A A +-|
| B A A |
| B A |
| A A |
| B A A |
| B A A |
| A A + + + + + + + + |
0 +-------------------------------------------------+
0 2 4 6 8 10 12 14 16 18 20
use strict;
use warnings;
use feature 'say';
use Data::Dump 'dd';
use Time::HiRes 'time';
use String::Eertree;
use lib '.';
use Eertree 'ee';
use Benchmark qw/ timeit :hireswallclock /;
use Chart::Gnuplot;
use File::Spec::Functions 'catfile';
use File::Basename 'dirname';
my $GP = catfile dirname( $^X ), '../../c/bin/gnuplot.exe';
die unless -x $GP; # assume Strawberry Perl
use constant REPEATS => 5;
STDOUT-> autoflush( 1 );
sub get_sample {
my $n = shift;
my $str = 'amanaplanacanalpanama';
my $s = '';
for ( 1 .. $n * 1000 ) {
substr $s, rand( length $s ), 0, $str
}
return $s
}
my ( @xdata, @ydata1, @ydata2, @datasets );
for my $n ( 1 .. 20) {
print "$n ";
my $s = get_sample( $n );
push @xdata, $n;
{
my $t = timeit REPEATS, sub { ee( $s, 1 )};
my $y = $t-> [ 1 ] / $t-> [ -1 ];
push @ydata1, $y;
}
{
my $t = timeit REPEATS, sub { String::Eertree-> new( string =>
+ $s )};
my $y = $t-> [ 1 ] / $t-> [ -1 ];
push @ydata2, $y;
}
}
print "\n\n";
push @datasets, Chart::Gnuplot::DataSet-> new(
xdata => \@xdata,
ydata => \@ydata1,
style => 'points',
),
Chart::Gnuplot::DataSet-> new(
xdata => \@xdata,
ydata => \@ydata2,
style => 'points',
);
my $chart = Chart::Gnuplot-> new(
gnuplot => $GP,
terminal => 'dumb size 60, 40',
);
$chart-> plot2d( @datasets );
Actually, thank you very much, @choroba, for introducing me to the subject. Very entertaining.
|