Re^6: Possessive sub-pattern with non-greedy content + recursion: WHY does this work??

Heh, just a week later and I kicked myself hard once and again to finally read the PDF paper in earnest, and, of other sources, only a less cruel text linked in a footnote in said PDF. The blog entry, linked from PWC #145 task #2, didn't open for me, which I didn't pursue. I.e. I didn't consult anyone's code before writing this implementation (speed, clarity, interface have, of course, room for improvement):

package Eertree;

use strict;
use warnings;

use Exporter 'import';
our @EXPORT_OK = qw/ ee /;

sub ee {
    my $str = shift;
    my $FULL = shift;   # if true: return "nodes" arrayref "as is", th
+en DIY: 
                        #   extract and expand to palindrome complete 
+list 
                        #   (it's trivial, isn't it?), work with each 
                        #   palindrome's position, do statistics, etc.
                        # if false: just return unique palindromes as 
+a list 
                        #   (note: it's faster because there's less in
+fo 
                        #   collected)
                        
    my @nodes = ( 
        { len => -1,    lnk => 0 },
        { len =>  0,    lnk => 0 }
    );
    my $lpspi = 0;      # "longest proper suffix-palindrome index"
                        # (of substring processed so far)
    
    for my $i ( 0 .. length( $str ) - 1 ) {
        my $c = substr $str, $i, 1;
        
        my $current = $lpspi;
        my $n = $nodes[ $current ];
        while ( $c ne substr $str, $i - $n-> { len } - 1, 1 ) {
            $current = $n-> { lnk };
            $n = $nodes[ $current ]
        }
        
        my $suffix;     # trailing palindrome index 
                        # after adding current character
        
        if ( $current == 0 ) {          # lone character "palindrome"
            
            # (note: this and following branches have some code in com
+mon i.e.
            # aren't dry, I didn't test whether consolidation would im
+pair 
            # speed, but I think a couple of vague conditionals, inste
+ad, 
            # could impair clarity
            
            if ( exists $n-> { edges }{ $c }) {     # already seen?
                $suffix = $n-> { edges }{ $c };
                push @{ $nodes[ $suffix ]{ offsets }}, $i if $FULL
            }
            else {
                my $p = { len => 1, lnk => 1, offsets => [ $i ]};
                push @nodes, $p;
                $n-> { edges }{ $c } = $suffix = $#nodes
            }
        }
        else {
            if ( exists $n-> { edges }{ $c }) {     # already seen?
                $suffix = $n-> { edges }{ $c };
                _descend( $suffix, $i, \@nodes ) if $FULL
            }
            else {
                my $L = 2 + $n-> { len };
                my $p = {
                    len => $L,
                    offsets => [ $i - $L + 1 ],
                };
                push @nodes, $p;
                $n-> { edges }{ $c } = $suffix = $#nodes;
                
                do {
                    $current = $n-> { lnk };
                    $n = $nodes[ $current ];
                } while ( $c ne substr $str, $i - $n-> { len } - 1, 1 
+);
                $p-> { lnk } = $n-> { edges }{ $c };
                
                _descend( $p-> { lnk }, $i, \@nodes ) if $FULL
            }
        }
        
        # make sure it's "proper"
        $lpspi = ( $i + 1 ) == $nodes[ $suffix ]{ len }
            ? $nodes[ $suffix ]{ lnk }
            : $suffix;
    }
    
    return \@nodes if $FULL;
    splice @nodes, 0, 2;
    return map substr( $str, $_-> { offsets }[ 0 ], $_-> { len }), @no
+des
}

sub _descend {
    my ( $x, $i, $nodes ) = @_;
    
    while ( $x > 1 ) {
        my $n_ = $nodes-> [ $x ];
        push @{ $n_-> { offsets }}, $i - $n_-> { len } + 1;
        $x = $n_-> { lnk }
    }
}

1;
[download]

Then I checked PWC #146 i.e. next one, for recap; there were _very_ few blog links (it was New Year's vacation time, unfortunately), and only _one_ says its Perl is Python's port of _real_ eertree. This one I ran and found just too slow for short inputs; then it gave "deep recursion" (what???) warnings for still too short but slightly longer input. I terminated the process it took too long. Then I checked only 2 more random GH directory PWC #145 entries; didn't find eertree there neither. Found a funny excuse _not_ to do an eertree :-). I'd be _really_ grateful if someone points me to a GH PWC #145 Perl's eertree code.

Which leaves String::Eertree only, I think. With a harness as in (grand)*parent node:

my $s = join "", map chr(ord('a') + int rand 26), 1 .. 200;
cmpthese(-3, {
    eertree => sub { 'String::Eertree'->new(string => $s)->uniq_palind
+romes },
    rosetta_new => sub { rosetta_fixed($s) },
    ee => sub { ee($s) },
});

              Rate rosetta_new     eertree          ee
rosetta_new  271/s          --        -58%        -93%
eertree      651/s        140%          --        -84%
ee          3964/s       1362%        509%          --
[download]

Which, I guess, is due to "Perl's subs/methods calls are too expensive" (_lots_ of calls in S::E internally) and/or "Moo"? or something else? But wait, comparison above is unfair ("give all advantage etc.") because 'ee' skips some work if only unique palindromes are required. Let's rather do this:

cmpthese(-3, {
    eertree => sub { 'String::Eertree'->new(string => $s) },
    ee => sub { ee($s, 1) },
    
});

          Rate eertree      ee
eertree  648/s      --    -80%
ee      3199/s    394%      --
[download]

I.e. contestants consume their strings, then build internal structures as required, without producing unique or non-unique lists (note: 'ee' also provides offsets for each (non-unique) pali which S::E doesn't. Not too much work, but still...) Yeah, that's more fair.

In fact random chars, very short (200 _is_ short) strings don't look interesting. I tried what follows with (what looked at the time as) more entertaining input -- a string up to 400+k chars with "real" pali randomly interspersed within. Maybe it's as uninteresting as previous. The purpose was to also check the linearity i.e. O(n)-ness which, as I squinted in a skewed way (guess in which favour) didn't look right to me at first. False alarm, both appear O(n) ("X" - thousands of repetitions of known pali (randomly placed) in the input; Y -- seconds; A -- ee, B -- S::E). But perhaps an input can be constructed which results in catastrophic slowdown in A and/or B. I have no proof i.e. can't write scientific PDF papers.

+
  2.5 +-------------------------------------------------+
      |    +    +    +    +    +    +    +    +    +    |
      |                                                 |
      |                                                 |
      |                                               B |
      |                                                 |
      |                                            B    |
    2 |-+                                             +-|
      |                                          B      |
      |                                       B         |
      |                                                 |
      |                                     B           |
      |                                                 |
      |                                  B              |
  1.5 |-+                              B              +-|
      |                                                 |
      |                             B                   |
      |                                                 |
      |                           B                     |
      |                                                 |
      |                        B                        |
      |                      B                          |
    1 |-+                                             +-|
      |                   B                             |
      |                                                 |
      |                 B                               |
      |              B                                A |
      |                                          A A    |
      |            B                        A A         |
  0.5 |-+                              A A            +-|
      |         B                 A A                   |
      |       B                A                        |
      |                   A  A                          |
      |    B         A  A                               |
      |  B      A  A                                    |
      |    A  A +    +    +    +    +    +    +    +    |
    0 +-------------------------------------------------+
      0    2    4    6    8    10   12   14   16   18   20
      
use strict;
use warnings;
use feature 'say';
use Data::Dump 'dd';
use Time::HiRes 'time';
use String::Eertree;
use lib '.';
use Eertree 'ee';

use Benchmark qw/ timeit :hireswallclock /;
use Chart::Gnuplot;
use File::Spec::Functions 'catfile';
use File::Basename 'dirname';

my $GP = catfile dirname( $^X ), '../../c/bin/gnuplot.exe';
die unless -x $GP;              # assume Strawberry Perl

use constant REPEATS => 5;

STDOUT-> autoflush( 1 );

sub get_sample {
    my $n = shift;
    
    my $str = 'amanaplanacanalpanama';
    my $s = '';
    for ( 1 .. $n * 1000 ) {
        substr $s, rand( length $s ), 0, $str
    }
    
    return $s
}

my ( @xdata, @ydata1, @ydata2, @datasets );

for my $n ( 1 .. 20) {
    print "$n ";
    my $s = get_sample( $n );
    
    push @xdata, $n;
    {
        my $t = timeit REPEATS, sub { ee( $s, 1 )};
        my $y = $t-> [ 1 ] / $t-> [ -1 ];
        push @ydata1, $y;
    }
    {
        my $t = timeit REPEATS, sub { String::Eertree-> new( string =>
+ $s )};
        my $y = $t-> [ 1 ] / $t-> [ -1 ];
        push @ydata2, $y;
    }
}
print "\n\n";

push @datasets, Chart::Gnuplot::DataSet-> new(
    xdata => \@xdata,
    ydata => \@ydata1,
    style => 'points',
), 
Chart::Gnuplot::DataSet-> new(
    xdata => \@xdata,
    ydata => \@ydata2,
    style => 'points',
);

my $chart = Chart::Gnuplot-> new( 
    gnuplot  => $GP,
    terminal => 'dumb size 60, 40',
);
$chart-> plot2d( @datasets );
[download]

Actually, thank you very much, @choroba, for introducing me to the subject. Very entertaining.

Comment on Re^6: Possessive sub-pattern with non-greedy content + recursion: WHY does this work?? Select or Download Code

Replies are listed 'Best First'.
Re^7: Possessive sub-pattern with non-greedy content + recursion: WHY does this work?? by Anonymous Monk on Aug 14, 2025 at 21:27 UTC
I'd be _really_ grateful if someone points me to a GH PWC #145 Perl's eertree code. Please don't, I completely forgot how to navigate PWC recap, and only checked "Blogs with Creative Title". There aren't too many not-empty GH sub-directories anyway. There, a few regex-based solutions, and quite a few Rosetta-Perl-snippet-like, but correct/fast (in O(N2) sense) -- a plenty to choose from to benchmark against instead of really really bad Rosetta Perl sample :-). Of eertree solutions: Mr. roger-bell-west's solution (gonna mention nicknames w/o people knowing... how else to refer to code snippets?) gives only 'a ama m ana n p l c' for 'amanaplanacanalpanama' input. They say it's Rosetta Python's port, but then it's broken port, had to disqualify, sorry. Mr. polettix' code generates "Use of uninitialized value in array element" warnings for non-trivial inputs, and 'ertreetre' "palindrome" for 'eereertreetree' input. Had to disqualify, sorry. Others seem to be correct, I gave them 1+ MB string input (50K randomly interspersed 'amanaplanacanalpanama's, similar to parent node), then in seconds: `colin-crain 4.41618514060974 e-choroba 8.4720139503479 james-smith 5.69543313980103 wlmb 0.999744176864624` [download] All but choroba's are able to generate unique PD's only; so it's not the same league/work they do. Judging by "ee" vs. "ee(1)", it's 1:2 ratio of work/time. As imperfect test as it is, but it reveals the fastest of them all (subroutines which modify lexicals in enclosing scope(s)? Assignment operator w/o space on each side? Go figure) -- i.e. until "ee" arrival 3.5 years later: `ee 0.81856107711792 ee (1) 1.62736701965332` [download] "ee" is a call to return unique PD's, "ee(1)" is "find them all". BTW, looking at Mr.james-smith and Mr.wlmb code/comments ("growing a PD on each side of a 'center'"?) -- it's either they do not quite "eertree" as described in PDF, or it's somewhat modified algorithm described elsewhere I haven't seen.	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re^7: Possessive sub-pattern with non-greedy content + recursion: WHY does this work??
by Anonymous Monk on Aug 14, 2025 at 21:27 UTC

I'd be _really_ grateful if someone points me to a GH PWC #145 Perl's eertree code.

Please don't, I completely forgot how to navigate PWC recap, and only checked "Blogs with Creative Title". There aren't too many not-empty GH sub-directories anyway. There, a few regex-based solutions, and quite a few Rosetta-Perl-snippet-like, but correct/fast (in O(N2) sense) -- a plenty to choose from to benchmark against instead of really really bad Rosetta Perl sample :-).

Of eertree solutions: Mr. roger-bell-west's solution (gonna mention nicknames w/o people knowing... how else to refer to code snippets?) gives only 'a ama m ana n p l c' for 'amanaplanacanalpanama' input. They say it's Rosetta Python's port, but then it's broken port, had to disqualify, sorry.

Mr. polettix' code generates "Use of uninitialized value in array element" warnings for non-trivial inputs, and 'ertreetre' "palindrome" for 'eereertreetree' input. Had to disqualify, sorry.

Others seem to be correct, I gave them 1+ MB string input (50K randomly interspersed 'amanaplanacanalpanama's, similar to parent node), then in seconds:

colin-crain     4.41618514060974
e-choroba       8.4720139503479
james-smith     5.69543313980103
wlmb            0.999744176864624
[download]

All but choroba's are able to generate unique PD's only; so it's not the same league/work they do. Judging by "ee" vs. "ee(1)", it's 1:2 ratio of work/time. As imperfect test as it is, but it reveals the fastest of them all (subroutines which modify lexicals in enclosing scope(s)? Assignment operator w/o space on each side? Go figure) -- i.e. until "ee" arrival 3.5 years later:

ee              0.81856107711792
ee (1)          1.62736701965332
[download]

"ee" is a call to return unique PD's, "ee(1)" is "find them all".

BTW, looking at Mr.james-smith and Mr.wlmb code/comments ("growing a PD on each side of a 'center'"?) -- it's either they do not quite "eertree" as described in PDF, or it's somewhat modified algorithm described elsewhere I haven't seen.

[reply]
[d/l]
[select]