Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses

Golf (Inspired): Repeated Phrases

by Masem (Monsignor)
on May 10, 2001 at 15:46 UTC ( [id://79364] : perlmeditation . print w/replies, xml ) Need Help??

The background here is this story, which was on slashdot as well as NPR and major wire services yesterday; for those without the time to read it, a professor caught up to 120 students (out of around 500) cheating on term assignments by comparing their electronically-submitted essays for 6 or more word phrases that were repeated in multiple papers. Those found cheating despite the school's honor code will be either denied their diplomas or have their diplomas revoked if they're already graduated.

While the story is somewhat chiling, I also wonder exactly how the professor approached the programming part of this. I'm very much doubting he used perl... :-)

Given two English text strings, $a and $b, and two integers $m and $n, 0 < $m <= $n. Both $a and $b have been stripped of punctuation and converted to lower case, leaving all characters as either ('a'..'z') or the space ' '.

Find the perl golf solution (fewest # of characters in code) that returns a list of phrases with at least $m but no more than $n words that are in both $a and $b.

update changed "$m < $n" to "$m <= $n"; shouldn't affect the golf solution, but makes sense if you want to find repeated phrases of only one size. Eg, if $m=$n=1, you could find all single words in common with both strings.

Dr. Michael K. Neylon - || "You've left the lens cap of your mind on again, Pinky" - The Brain

2001-05-10 Edit by Corion: Fixed title

Replies are listed 'Best First'.
Re (tilly) 1: Golf (Inspired): Repeated Phrases
by tilly (Archbishop) on May 10, 2001 at 17:34 UTC
    I find this question unclear. Suppose that $m is 2 and $n is 3. Given that the phrase "this is strange" appears, do you want to see ("this is strange") returned or ("this is strange", "this is", "is strange")?

    From both a programming and a spec problem I like only looking for phrases of a given size (which was what was done in the story). From that you can derive your solution through:

    sub p{ ($m,$n,@_)=@_;map{f($_,@_)}$m..$n }
    An incidental note. The mathematician in me reads your phrase "a list of" and wants to offer the following cheat as a "solution":
    sub p{()}
    (It is a list, and everything in it is a phrase of the right size that appears in both lists!)

    And, of course, I can show that all golf can be improved with the utterly ridiculous:

    sub p{}
      For sake of the golf problem, your example would return all 3 cases if m,n is 2,3. Extra credit for not including and/or removing phrases of x words that are part of phrases with y (y > x) words in the returned list.

      And the list must be of non-null words, so no trivial solutions :-)

      Dr. Michael K. Neylon - || "You've left the lens cap of your mind on again, Pinky" - The Brain
Re: Golf (Inspired): Repeated Phrases
by chipmunk (Parson) on May 10, 2001 at 19:51 UTC
    Here's a solution to start things off. I assume that runs of spaces in the strings have been squashed (tr/ //s;). For this solution, a phrase will appear in the results more than once if it occurs in the first string more than once and the second string at least once.
    sub repeated { my@p;$_=pop()-1 for$m,$n; $_[1]=~($~=$1.$2)&&push@p,$~while$_[0]=~/(\w+)(?=(( \w+){$n,$m}))/g; @p }
    95 characters.

    Update: Shortened with the help of \b:

    sub repeated { my@p;$m=pop;$n=pop; $_[1]=~$1&&push@p,$&while$_[0]=~/\b(?=((\w+\b ?){$n,$m}))/g; @p }
    81 characters.

    Update: Gotta remember to reset the array when using push! Thanks MeowChow! Added 5 characters to each of the above.

    Update: MeowChow was actually pointing out a different problem... my code doesn't properly find phrases with less than the max number of words... I'm going to redo this completely and start a new sub thread when I get a working solution. (Darn you to heck, MeowChow! :D )

      I don't think this works quite right. Compare the output from the following two runs:
      @l = repeated('this dog a round', 'this dog a pound', 1, 3); @l = repeated('this dog a round', 'this dog a pound', 1, 4);
      UPDATE: Here's an 85-byte version that returns all the matches:
      sub repeated { ($s,$_,$m,$n)=@_;my@p;for$i($m..$n){$s=~$1&&push@p,$&while/\b(?=((\w+\ +b ?){$i}))/g}@p }
                     s aamecha.s a..a\u$&owag.print
Re: Golf (Inspired): Repeated Phrases
by chipmunk (Parson) on May 10, 2001 at 21:01 UTC
    Okay.... I was quite close with my original solution, but failed to fully test it. MeowChow pointed out the error in my approach; {$n,$m} will only match once, whereas I need it to match the whole range from $n to $m.

    So, I just needed to add another loop, and voila:

    sub repeated { ($_,$t,$n,$m,@p)=@_; for$i($n..$m){$t=~$1&&push@p,$&while/\b(?=((\w+\b ?){$i}))/g}@p }
    So that's 83 characters long, with help from MeowChow's solution.
Re: Golf (Inspired): Repeated Phrases
by Corion (Patriarch) on May 10, 2001 at 20:05 UTC

    My solution is not really a competitor, since I'm not that familiar with Perl Golf customs, so bear with me.

    use strict; sub f { my ($a,$b,$m,$n) = @_; my (@r) = (); my $l = " $a! $b"; while($l =~ s/(.*?)(( \w+){$m,$n} )(.*!.*\2.*)/ $1 =$4/) { push @r, $2; }; @r };

    Why I'm posting this at all is, that a previous version exhibited unanticipated (by me, that is) behaviour :

    # Previous version : my $l = " $a! $b"; while($l =~ s/(.*)(( \w+){$m,$n} )(.*!.*\2.*)/ $1 =$4/) { push @r, $2; }; @r

    That version only found the $m-word matches, because the greedy (.*) at the beginning forced the RE engine to work its way backwards through the string. As soon as I sacrificed another character, this problem went away. I haven't given the golfing much thought (at work, I can only think about concepts, not byte-fiddle with REs without raising suspicions ... ).