in reply to Comparing Lines within a Word List

#!/usr/bin/env perl use strict; use warnings; use Data::Dumper; # Filter out all words that can't be candidates one way or the other my @candidates=grep { m{[RS]} } <DATA>; chomp(@candidates); # Sort them in order of length my @orderedCandidates=sort{ length $a <=> length $b } @candidates; # Push a guard on the end push @orderedCandidates,""; warn Data::Dumper->Dump([\@orderedCandidates],[qw(*orderedCandidates)] +),' '; my %hash; my $length=length($orderedCandidates[0]); for my $word (@orderedCandidates) { if (length($word) == $length) { # Same length ... add it to the ha +sh $hash{$word}=undef; } else { # Hash is complete for my $Rword (grep { m{R} } keys %hash) { # Word has a R my $pos=0; while (($pos=index($Rword,'R',$pos)) >= 0) { # Found an R # Make its S equivalent my $Sword=$Rword; substr($Sword,$pos,1)='S'; print "$Rword - $Sword\n" if (exists $hash{$Sword}); $pos++ } } # Done with this hash --- so start over %hash=(); $hash{$word}=undef; $length=length($word); }; } __DATA__ ONE THREE FOUR FOUS RRRRRRRR RSRRRRRS RRSRRRRR
which yields
FOUR - FOUS RRRRRRRR - RRSRRRRR
Just for fun I modified this replacing R by each of the letters of the alphabet and S by another (using english-words.95 (~ 220,000 words). The R <=> S exchange is the most common with 381 words and the J <=> Q is the least common with only 5 words. Elapsed time for the run 344.65 seconds.

Replies are listed 'Best First'.
Re^2: Comparing Lines within a Word List
by dominick_t (Acolyte) on Apr 27, 2016 at 19:57 UTC
    Thank you! Your explanatory comments regarding the overall approach and the details are exactly what I need to get my bearings. I'll now dive in there and make it all make sense to me. One question which I already have: I assume I should not just copy and paste my own data set, which is a massive file, into the code here to replace the example data set. What's the best way to handle that part? Many many thanks.

      Put your filename as first argument on command line, then change <DATA> to <>

        Do you mean that I should use the cd command in Terminal to change the directory to my word list file? Or is there some other function which my filename would be the argument of?
Re^2: Comparing Lines within a Word List
by dominick_t (Acolyte) on Apr 29, 2016 at 21:04 UTC
    Hello and thank you again for posting this, clueless newbie. Unlike you I am in fact a clueless newbie, and I can't seem to run your code correctly, with my own word list for the data set. Might you have a minute to help, either here or via private message?
      Here you go ... more comments:
      #!/usr/bin/env perl use strict; use warnings; use Data::Dumper; use Time::HiRes qw(time); die <<"__DOC__" perl $0 <dictionary file name> [<optional character> [<option swap +>]] so perl $0 <dictionary file name> R S will solve the problem as stated in the OP and perl $0 <dictionary file name> will try all of the pairs. (note that they are symmetrical ie r-s and +s-r will be the same) __DOC__ unless (@ARGV); # Get all the words into an array open(my $DATA,'<',$ARGV[0]) or die "Couldn't open '$ARGV[0]' for reading! $!"; my @all=<$DATA>; close($DATA) or die "Couldn't close '$ARGV[0]' after reading! $!"; chomp(@all); # @all now holds all of the words from the dictionary specified by $AR +GV[0] my $start=time(); my %counts; for my $R ($ARGV[1] || ('a'..'y')) { # Take $ARGV[1] or the letters fr +om a to y (no point to doing z) one at a time my $count=0; for my $S ($ARGV[2] || (chr(ord($R)+1)..'z')) { # Take $ARGV[2] or + the letters following $R one at a time my $re=qr{[$R$S]}; # Filter out everything that isn't relevant # (if it doesn't have $R or $s it can't be a word to be altere +d or a word after alteration) my @candidates=grep{ m{$re} } @all; # Order them by length (no use comparing a 4 letter word with +a 5 letter word) @candidates=sort{ length $a <=> length $b } @candidates; # Put a guard at the end of the array - to trigger the "comple +tion" push @candidates,''; my %hash; # Something to count up the number of matches my $count; # Initialize $length by setting it to length of the first word + (failure to do this is of no consequnce!) my $length=length $candidates[0]; for my $word (@candidates) { unless (length($word) == $length) { # Current word is of d +ifferent length - so we need to process everything already in the has +h for my $Rword (grep { m{r} } keys %hash) { # Word has +a $R my $pos=0; while (($pos=index($Rword,'r',$pos)) >= 0) { # Fou +nd a $R in $Rword # Make its S equivalent my $Sword=$Rword; substr($Sword,$pos,1)='s'; # Increment $count if $Sword appears in the ha +sh $count++ if (exists $hash{$Sword}); # Need to look for the next $R so $pos must be + incremented $pos++ } } # Done with this hash --- so recycle it (hey I'm green +! --- or really I date back to machines that only had 10,000 digits!) %hash=(); # A new length $length=length($word); }; $hash{$word}=undef; }; #warn "$R-$S: ",$count; # Save our count $counts{"$R-$S"}=$count; }; }; # All done - see how long all this took my $end=time(); printf("%.2f\n", $end-$start); # Dump the counts hash --- but we want the keys in sorted order $Data::Dumper::Sortkeys=1; print Data::Dumper->Dump([\%counts],[qw(*counts)]); # All done exit; __DATA__