Public Scratchpad | Download, Select Code To D/L |
Palindrome codelet moved here Re: regex at word boundary
Testing against QM's RE solution. Also an Update on the effect of using Regexp::Common, see bottom.
mikeraz@tire:~/pal$ file tdata tdata: ISO-8859 mail text, with very long lines mikeraz@tire:~/pal$ wc tdata 216527 539118 13345250 tdata mikeraz@tire:~/pal$ time ./mr_pal.pl tdata > mr.out ; time ./qm_pal.pl tdata > qm.out real 0m9.914s user 0m9.799s sys 0m0.073s real 1m37.445s user 1m32.658s sys 0m0.135s mikeraz@tire:~/pal$ perl -v This is perl, v5.8.7 built for i486-linux-gnu-thread-multitdata is a mbox file with 591 messages. Repeated runs return essentially the same results. Note: these tests were performed earlier on a Solaris Sparc 5 with Perl v5.6.1 and the QM solution did not work at all, the RE used is not valid in that environment.
I'm surprised here by the extreme difference in the run time of the two routines. Here is the acutal code snippets I'm using in the test.
mr_pal.pl
qm_pal.pl#!/usr/bin/perl -w use strict; use POSIX; # find palindromes in text file my ($le, @lines, @F, $test, $pal, %pals, $start_char, $i, $word); # cross line boundries but not paragraph boundries # #'d out to match matching of comparison code #$le = $/; # $/ = "\n\n"; # @lines = <>; # $/ = $le; while (<>) { s/[[:punct:]]//g; (@F) = split; while (int @F) { # select array slices where last letter of last word in # slice equals first letter of first word $start_char = lc substr $F[0], 0, 1; foreach $i (1 .. $#F) { if( (lc substr $F[$i], -1) eq $start_char) { # test for palindrome $test = lc join "", @F[0..$i]; if($test eq reverse $test) { $pal = join " ", @F[0..$i]; print "$. $pal\n"; } } } # grab single word palindromes $word = shift @F; if(length $word > 2 && lc $word eq lc reverse $word) { print "$. $word\n"; } } }
#!/usr/bin/perl # find all palindrome phrases on each line # only phrases of two or more alpha # a phrase starts and ends on a word boundary # # nested palindrome phrases will also be found # (including single words) use strict; use warnings; our $N; my $re = qr/( # start $1 \b # left word edge ([a-z].*[a-z]) # at least 2 alpha \b # right word edge (??{ # start code local $N = lc $^N; # save capture $N =~ tr!a-zA-Z!!dc; # remove non-alpha # fail if not pal '(?!)' if (lc($N) ne reverse lc($N)); }) # end code ) # end $1 /ix; while (<>) { my $found; while ( /$re/g ) { # print "line $.:\n" unless $found; pos = pos() - length($1); # find nested pals # print "(", pos, ") \"$1\"\n"; print "$. $1\n"; $found = 1; # \n between groups pos = pos() + 1; # advance one } print "\n" if $found; } exit; __DATA__ god dog alpha beta gamma stop pots wonka wonka wonka bookkeeper raisinhead what is a bookkeeper pop repeekkoob tomorrow? boob tube A man, a plan, a canal, Panama! kook peep aha aba abba abbba aabbbaa abbba abba aba nested testset detsen nested i prefer pi ip referp
Update: to not reinvent the wheel I installed Regexp::Common and modified my code to use it:
mikeraz@tire:~/pal$ diff -u mr_pal.pl re_pal.pl --- mr_pal.pl 2005-12-12 08:24:57.000000000 -0800 +++ re_pal.pl 2005-12-13 06:47:01.080509464 -0800 @@ -1,6 +1,10 @@ #!/usr/bin/perl -w use strict; use POSIX; +use Regexp::Common qw /lingua/; + +# /^$RE{lingua}{palindrome}$/ and print "is a palindrome\n"; + # find palindromes in text file @@ -25,7 +29,7 @@ # test for palindrome $test = lc join "", @F0..$i; - if($test eq reverse $test) { + if($test =~ /^$RE{lingua}{palindrome}$/) { $pal = join " ", @F0..$i; print "$. $pal\n"; }and it's runtime is also slow, as in:
mikeraz@tire:~/pal$ time ./re_pal.pl tdata > out.re real 1m3.960s user 1m2.199s sys 0m0.133sPerhaps I should unroll my logic a bit so the lingua test acts on the whole line, rather than my decomposing it first,
So it seems that it is optimal to find a palindrome candidate, strip out punctuation and spacing, and test by comparing the string against a reverse of itself.
What are the holes in this analysis?