Simple regex wordlist question

escherist has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Simple regex wordlist question by Zaxo (Archbishop) on Sep 11, 2007 at 05:24 UTC
That's an anagram problem, which has a canonical answer. First absorb your dictionary in a hash: `my %words; { open my $fh, '<', '/path/to/dict.txt' or die $!; while <$fh> { chomp; my $key = join '', sort split ''; push @{$words{$key}}, $_; } }` [download] That collects your data, now just rearrange your input word to make a key and print the associated words: `my $word = <>; # or however you get it chomp; $word = join '', sort split '', $word; print "@{$words{$word}}\n";` [download] Alphabetizing and mushing together the characters of a word provides a key for anagrams.. If this is used a lot or the dictionary is big, you may want to keep the "hash" in a database. After Compline, Zaxo	[reply] [d/l] [select]
Re: Simple regex wordlist question by ikegami (Patriarch) on Sep 11, 2007 at 04:46 UTC
What you want is "The currect dict word contains the first letter, the currect dict word contains the second letter, ... and the currect dict word contains the last letter." Regexp are not good at doing "AND". It's usually better to do multiple matches, and AND the results of the matches. Update: No, I suppose that's not quite right. You want "All letters of the currect dict word is in the specified word." A regexp character class can handle that quite well. `use strict; use warnings; my $input = <STDIN>; my ($re) = map qr/^[$_]$/, join '', map quotemeta, $input =~ /(.)/g; my $dict = 'wordlist.txt'; open(my $dict_fh, '<', $dict) or die("Unable to open dictionary file: $!\n"); while (<$dict_fh>) { print if /$re/; }` [download] Update*: Note that the OP didn't mention each letter could only be used once, and he didn't mention that all the letters needed to be used, so I didn't add those restrictions.	[reply] [d/l]
Re^2: Simple regex wordlist question by goibhniu (Hermit) on Sep 11, 2007 at 16:00 UTC
I think he is working on the puzzle where you think of as many words that can be spelled using letters in the given word. But what about quantity? I've already given ++ to Zaxo and CountZero for their anagram solutions, but I'm not sure now that's the point An example would help, but if I'm right, he wants, given 'PerlMonks' a list like: Perl (of course) role Poke sole solemn But would not like 'reel', as there are not two 'e's in the input. Is this what johngg has? perhaps vcTheGuru is closer to right with Math::Combinatorics, except maybe using the combine function looping from 1..($#input+1). I humbly seek wisdom.	[reply]
Re^3: Simple regex wordlist question by goibhniu (Hermit) on Sep 11, 2007 at 21:00 UTC
The following needs a great deal of polish. I took ikegami's basic regexp as a superset of the words you're looking for. Then I horribly misused it without all the safety that ikegami was right to include (qr// syntax, etc.). I also ended up turning your semantic inside out in that the dictionary file is now on the command line and the starting word is hard coded; sorry about that. The guts of this solution involve counting letters in a regexp: Read more... (4 kB) I humbly seek wisdom.	[reply] [d/l] [select]
Re: Simple regex wordlist question by CountZero (Bishop) on Sep 11, 2007 at 05:38 UTC
Sometimes you have to take a step back to see the big picture. My solution would be to pre-process the dictionary list and make it into a "key = value" list. The key is the sorted order of the value: so "Apache" becomes: "aacehp = Apache". Splitting and sorting the words in your dictionary file is as easy as `sort split '', lc $word` where $word contains a word in your file. The problem then becomes as easy as splitting and sorting your input in the same way and checking your pre-processed list for a match: `if ($guess eq $try_this){...}`. If the list itself is sorted a binary search would be blindingly fast even on huge dictionaries. Granted the pre-processing takes some time but it is a one time investment only and can of course be saved on persistent storage for a next time use. Update: Zaxo beat me to the punch. I blame our new kitten Hobbes which badly needed some petting. CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James	[reply] [d/l] [select]
Re: Simple regex wordlist question by atemon (Chaplain) on Sep 11, 2007 at 05:04 UTC
Your code is checking each line in the file with the input string. For that another way to do is, get all permutations from the input string, then compare each line in the file with all the permutations you got. To get all permutations, you can use Math::Combinatorics. Get all permutations into an array and you can use grep to compare. I have updated your code and is working fine for me. The code I updated is `#!/usr/bin/perl use strict; use warnings; use Math::Combinatorics; #to get all combinations my $input = <STDIN>; my @inputarray = split('',$input); my $file = 'wordlist.txt'; open(INFO, $file); my @lines = <INFO>; close(INFO); my @permutations= map { join "", @$_ } permute(@inputarray );# Get all + permutations of input string my %dup; @dup{@permutations}= (); my @tacobell= grep { exists $dup{$_} } @lines; # Check if any of the c +ombinations exists in file print join ("\n",@tacobell); exit();` [download] Update: Fixed typo. Cheers ! --VC There are three sides to any argument..... your side, my side and the right side.	[reply] [d/l]
Re^2: Simple regex wordlist question by b4swine (Pilgrim) on Sep 11, 2007 at 05:19 UTC
This could get awfully slow. 6! is only 720, but if he went to 7! that would be 5040. The OP's solution is faster, and ikegami's solution is probably even more so.	[reply]
Re: Simple regex wordlist question by johngg (Canon) on Sep 11, 2007 at 11:01 UTC
You could do it with regular expression look-aheads. You have to make sure that you cater for input that has duplicate letters, e.g. butter or banana. You can mandate the length of result words in the regex without having to cut down your dictionary file. use strict; use warnings; my $dictFile = q{2of12.txt}; open my $dictFH, q{<}, $dictFile or die qq{open: $dictFile: $!\n}; chomp(my @words = <$dictFH>); close $dictFH or die qq{close: $dictFile: $!\n}; my $input = shift; my $charCt = length $input; my %chars; $chars{$_} ++ for split m{}, $input; my $charsWanted = join q{}, map { qq{(?=@{ [ qq{.*$_} x $chars{$_} ] })} } keys %chars; my $wordLength = qq{(?=.{@{ [ length $input ] }}\$)}; my $rxDict = qr{(?x)^$charsWanted$wordLength}; print map { qq{$_\n} } grep { m{$rxDict} } @words; [download] This seems to work quite fast, the reading of the dictionary file taking most of the time. Cheers, JohnGG	[reply] [d/l]
Re: Simple regex wordlist question by escherist (Initiate) on Sep 11, 2007 at 04:48 UTC
Anything more elegant than: `foreach $tacobell (@lines) { if ($tacobell =~ /$inputarray[0]/) { if ($tacobell =~ /$inputarray[1]/) { if ($tacobell =~ /$inputarray[2]/) { if ($tacobell =~ /$inputarray[3]/) { if ($tacobell =~ /$inputarray[4]/) { if ($tacobell =~ /$inputarray[5]/) { print $tacobell; } } } } } } }` [download] I know I'm no programming expert :/	[reply] [d/l]
Re: Simple regex wordlist question by klekker (Pilgrim) on Sep 11, 2007 at 06:27 UTC
Since you have alread build an array, I'd give grep a try: `my @matching_words = grep(/$your_regular_expression/, @lines);` [download] and `$your_regular_expression` is something like `/[anshef]{6}/i` (where 'anshef' is an example for your six valid characters for the word). k	[reply] [d/l] [select]
Re: Simple regex wordlist question by hangon (Deacon) on Sep 11, 2007 at 06:44 UTC
Here's how I would do it, with just a modification of your regex. `use strict; use warnings; my $input = <STDIN>; chomp $input; my $file = 'wordlist.txt'; open(INFO, $file); my @lines = <INFO>; close(INFO); for my $word (@lines){ if ($word =~ /^[$input]+$/){ print "$word\n"; } }` [download] Udate: I may have misinterpreted the OP. This obviously will not work if all six input characters must be used.	[reply] [d/l]
Re: Simple regex wordlist question by Anonymous Monk on Sep 11, 2007 at 18:44 UTC
Thank you guys for all of the help in getting this fixed up. I started playing with the Math::Combinatorics method before I saw any of the other posts then my gf made me go to bed so I haven't had a chance to play with them too much :/ This is what I have right now though. Something still isn't quite right with the matching from the dict file (I changed it to match 3-6 letter words from a full dictionary instead fyi). The @permutationsX arrays do in fact hold the different permutations of 6choose(3-6) but there's some problem matching the words to those in the dictionary file. #!/perl use strict; use warnings; use Math::Combinatorics; my $input = <STDIN>; chomp($input); my @inputarray = split('',$input); my $file = 'wordlist.txt'; open(INFO, $file); my @lines = <INFO>; close(INFO); chomp(@lines); print "\n\n\n\nGenerating Wordlist from \'$input\'...\n"; print "----------\n"; #find all 3 letter words and print out my @permutations3 = map { join "", @$_ } combine(3,@inputarray); my %dup3; @dup3{@permutations3} = (); my @tacobell3 = grep { exists $dup3{$_} } @lines; foreach (@tacobell3) { print "$_\n"; } #find all 4 letter words and print out my @permutations4 = map { join "", @$_ } combine(4,@inputarray); my %dup4; @dup4{@permutations4} = (); my @tacobell4 = grep { exists $dup4{$_} } @lines; foreach (@tacobell4) { print "$_\n"; } #find all 5 letter words and print out my @permutations5 = map { join "", @$_ } combine(5,@inputarray); my %dup5; @dup5{@permutations5} = (); my @tacobell5 = grep { exists $dup5{$_} } @lines; foreach (@tacobell5) { print "$_\n"; } #find all 6 letter words and print out my @permutations6 = map { join "", @$_ } permute(@inputarray); my %dup6; @dup6{@permutations6} = (); my @tacobell6 = grep { exists $dup6{$_} } @lines; foreach (@tacobell6) { print "$_\n"; } print "----------\n\n\n"; $_ = <STDIN>; exit(); [download] The point Goib brought up is correct. Given the letters tuegps, I would want to output 'getups' 'getup' 'guest' 'gust' 'tug' etc. The code I have is SO CLOSE! It's just having an issue matching the words :/ Thanks again for all the help! -Alan	[reply] [d/l]