mikeraz's scratchpad

Public Scratchpad

GUIs for Perl? QtGui Gtk2

Palindrome codelet moved here Re: regex at word boundary

Testing against QM's RE solution. Also an Update on the effect of using Regexp::Common, see bottom.

mikeraz@tire:~/pal$ file tdata
tdata: ISO-8859 mail text, with very long lines
mikeraz@tire:~/pal$ wc tdata
  216527   539118 13345250 tdata
mikeraz@tire:~/pal$ time ./mr_pal.pl tdata > mr.out ; time ./qm_pal.pl tdata > qm.out

real    0m9.914s
user    0m9.799s
sys     0m0.073s

real    1m37.445s
user    1m32.658s
sys     0m0.135s

mikeraz@tire:~/pal$ perl -v

This is perl, v5.8.7 built for i486-linux-gnu-thread-multi

tdata is a mbox file with 591 messages. Repeated runs return essentially the same results. Note: these tests were performed earlier on a Solaris Sparc 5 with Perl v5.6.1 and the QM solution did not work at all, the RE used is not valid in that environment.

I'm surprised here by the extreme difference in the run time of the two routines. Here is the acutal code snippets I'm using in the test.

mr_pal.pl

#!/usr/bin/perl -w
use strict;
use POSIX;

# find palindromes in text file

my ($le, @lines, @F, $test, $pal, %pals, $start_char, $i, $word);

# cross line boundries but not paragraph boundries
# #'d out to match matching of comparison code
#$le = $/;
# $/ = "\n\n";
# @lines = <>;
# $/ = $le;

while (<>) {
    s/[[:punct:]]//g;

    (@F) = split;
    while (int @F) {
        # select array slices where last letter of last word in
        # slice equals first letter of first word
        $start_char = lc substr $F[0], 0, 1;
        foreach $i (1 .. $#F) {
            if( (lc substr $F[$i], -1) eq  $start_char) {

            # test for palindrome
            $test = lc join "", @F[0..$i];
                if($test eq reverse $test) {
                    $pal = join " ", @F[0..$i];
                    print "$. $pal\n";
                }
            }
        }

        # grab single word palindromes
        $word = shift @F;
        if(length $word > 2 && lc $word eq lc reverse $word) {
                print "$. $word\n";
        }
    }
}
[download]

qm_pal.pl

#!/usr/bin/perl

# find all palindrome phrases on each line
#     only phrases of two or more alpha
#     a phrase starts and ends on a word boundary
#
# nested palindrome phrases will also be found
# (including single words)

use strict;
use warnings;
our $N;

my $re =
  qr/(                         # start $1
      \b                       # left word edge
      ([a-z].*[a-z])           # at least 2 alpha
      \b                       # right word edge
      (??{                     # start code
          local $N = lc $^N;   # save capture
          $N =~ tr!a-zA-Z!!dc; # remove non-alpha
                               # fail if not pal
          '(?!)' if (lc($N) ne reverse lc($N));
         })                    # end code
     )                         # end $1
    /ix;

while (<>)
{
  my $found;
  while ( /$re/g )
  {
    # print "line $.:\n" unless $found;
    pos = pos() - length($1);  # find nested pals
    # print "(", pos, ") \"$1\"\n";
    print "$. $1\n";
    $found = 1;                # \n between groups
    pos = pos() + 1;           # advance one
  }
  print "\n" if $found;
}

exit;

__DATA__
god dog
alpha beta gamma stop pots wonka wonka wonka
bookkeeper raisinhead
what is a bookkeeper pop repeekkoob tomorrow?
boob tube A man, a plan, a canal, Panama! kook peep aha
aba abba abbba aabbbaa abbba abba aba
nested testset detsen nested
i prefer pi ip referp
[download]

Update: to not reinvent the wheel I installed Regexp::Common and modified my code to use it:

mikeraz@tire:~/pal$ diff -u mr_pal.pl re_pal.pl
--- mr_pal.pl   2005-12-12 08:24:57.000000000 -0800
+++ re_pal.pl   2005-12-13 06:47:01.080509464 -0800
@@ -1,6 +1,10 @@
 #!/usr/bin/perl -w
 use strict;
 use POSIX;
+use Regexp::Common qw /lingua/;
+
+# /^$RE{lingua}{palindrome}$/    and  print "is a palindrome\n";
+

 # find palindromes in text file

@@ -25,7 +29,7 @@

             # test for palindrome
             $test = lc join "", @F0..$i;
-                if($test eq reverse $test) {
+                if($test =~ /^$RE{lingua}{palindrome}$/) {
                     $pal = join " ", @F0..$i;
                    print "$. $pal\n";
                 }

and it's runtime is also slow, as in:

mikeraz@tire:~/pal$ time ./re_pal.pl tdata > out.re

real    1m3.960s
user    1m2.199s
sys     0m0.133s

Perhaps I should unroll my logic a bit so the lingua test acts on the whole line, rather than my decomposing it first,
So it seems that it is optimal to find a palindrome candidate, strip out punctuation and spacing, and test by comparing the string against a reverse of itself.

What are the holes in this analysis?