BrowserUk,
Seems to me that you've fallen into a pattern of looking at things in terms of combinations, permutations & powersets et al.

You are absolutely correct. I did look at the problem from many different sides but, with the subconscious objective of precomputing everything, they all turned out to be different variations on the same theme.

You could build a trie, but they are not efficient built in terms of perl's available data structures.

Assuming that the word list file will remain fairly static and assuming it transforms into a data structure (trie or trie like) small enough that can stay memory resident, this seems like a reasonable approach. Using the word list you linked to earlier (TWL06.txt) with 178,590 eligible words, I use just under 80MB with the following solution:

#!/usr/bin/perl use strict; use warnings; use Getopt::Std; use Storable; use Time::HiRes qw/gettimeofday tv_interval/; my %opt; get_args(\%opt); if (defined $opt{w}) { my %data; open(my $fh, '<', $opt{w}) or die "Unable to open '$opt{w}' for re +ading: $!"; while (<$fh>) { chomp; next if length($_) < 3 || /[^a-zA-Z]/; $_ = lc($_); eval join('', 'push @{$data', (map {"{$_}"} sort split //, $_) +, "{words}}, '$_';"); } store(\%data, $opt{d}) or die "Can't store '%data' in '$opt{d}'\n" +; } my $data = retrieve($opt{d}); die "Unable to retrieve from '$opt{d}'\n" if ! defined $data; my $str = join('', map {('a' .. 'z')[rand 26]} 1 .. $opt{n}); print "Working from $str\n"; # Start time my $beg = [gettimeofday]; $str = join('', sort split //, $str); my @work; for (0 .. length($str) - 3) { my $tree = $data->{substr($str, $_, 1)}; push @work, [$tree, substr($str, $_ + 1)] if $tree; } my %seen; while (@work) { my $item = pop @work; my ($data, $str) = @$item; for (@{$data->{words} || []}) { print "$_\n" if ! $seen{$_}++; } my $last_pos = length($str) - 1; for (0 .. $last_pos) { my $tree = $data->{substr($str, $_, 1)}; next if ! $tree; my $new_str = $_ < $last_pos ? substr($str, $_ + 1) : ''; push @work, [$tree, $new_str]; } } my $end = [gettimeofday]; print "Found ", scalar keys %seen, " words in ", tv_interval($beg, $en +d), " seconds\n"; sub get_args { my ($opt) = @_; my $Usage = qq{Usage: $0 <-d <datafile>> [-n <chars> -w <wordfile> + -h] -h : This help message -d : The (d)atastructure file Note: This will be where the wordlist file is stored aft +er conversion -w : The (w)ordlist file Note: Specify this option to build a new datastructure -n : The (n)umber of random characters to form words from Default: 7 } . "\n"; getopts('hd:w:n:', $opt) or die $Usage; die $Usage if $opt->{h} || ! defined $opt->{d}; $opt->{n} = 7 if ! defined $opt->{n} || $opt->{n} =~ /\D/ || $opt- +>{n} < 3; }

The alternative is to use a set of bitstrings to to index the words containing each of the letters.

I have had a note to go back and figure out what you were doing here for over a year now. Today, I sat down to do just that. Would you mind reviewing what I have and correcting anything I got wrong? Note: I rewrote it in my own style as a mechanism for understanding it.

#!/usr/bin/perl use strict; use warnings; my $file = $ARGV[0] || 'TWL06.txt'; open(my $fh, '<', $file) or die "Unable to open '$file' for reading: $ +!"; my @word; while (<$fh>) { chomp; next if length($_) < 3 || /[^a-zA-Z]/; push @word, lc($_); } my %index; for ('a' .. 'z') { # Bitstring long enough to represent all words - all bits set to 0 $index{$_} = chr(0) x int((@word + 8) / 8); } # For each unique char in every word # Set the bit corresponding to the index of the word to 1 for my $idx (0 .. $#word) { my %seen; for my $chr (sort split //, $word[$idx]) { vec($index{$chr}, $idx, 1) = 1 if ! $seen{$chr}++; } } print "Please enter an input string: "; chomp(my $input = <STDIN>); my @include = split //, $input; my @exclude = grep {! (1 + index($input, $_))} 'a' .. 'z'; # list of l +etters not in input string my $mask = chr(0) x int((@word + 8) / 8); # Turn on bits for all words that have at least 1 letter in common wit +h input string $mask |= $_ for @index{@include}; # Turn off bits for any word that contains at least 1 letter not in th +e input string for (@exclude) { # Words not containing excluded letter (though they may contain le +tters not in input string) my $remain = ~ $index{$_}; # Words that do not have excluded letter but do have letter in com +mon with input string $mask &= $remain; } for my $idx (0 .. $#word) { next if ! vec($mask, $idx, 1); # Not even a candidate next if ! finalCheck($word[$idx], $input); print "$word[$idx]\n"; } sub finalCheck { my ($candidate, $allowed) = @_; for (split //, $candidate) { $allowed =~ s/$_// or return; # return if the letter does not +remain in the candidate list } return 1; }

Assuming I understood it correctly, there isn't a lot of room for optimizations. Instead of recreating the zeroed bitstring 27 times, just do it once. The finalCheck() could be inlined (or converted to Inline::C). It may be faster to skip candidate words that are longer than the input string. You could also use Storable the same way I did to reduce the constant time construction of the data structure. I feel silly that I didn't spend some time a year ago to try and properly understand this as it is quite beautiful.

Cheers - L~R


In reply to Re^2: Turning A Problem Upside Down by Limbic~Region
in thread Turning A Problem Upside Down by Limbic~Region

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.