in reply to Re: Turning A Problem Upside Down
in thread Turning A Problem Upside Down
You are absolutely correct. I did look at the problem from many different sides but, with the subconscious objective of precomputing everything, they all turned out to be different variations on the same theme.
You could build a trie, but they are not efficient built in terms of perl's available data structures.
Assuming that the word list file will remain fairly static and assuming it transforms into a data structure (trie or trie like) small enough that can stay memory resident, this seems like a reasonable approach. Using the word list you linked to earlier (TWL06.txt) with 178,590 eligible words, I use just under 80MB with the following solution:
#!/usr/bin/perl use strict; use warnings; use Getopt::Std; use Storable; use Time::HiRes qw/gettimeofday tv_interval/; my %opt; get_args(\%opt); if (defined $opt{w}) { my %data; open(my $fh, '<', $opt{w}) or die "Unable to open '$opt{w}' for re +ading: $!"; while (<$fh>) { chomp; next if length($_) < 3 || /[^a-zA-Z]/; $_ = lc($_); eval join('', 'push @{$data', (map {"{$_}"} sort split //, $_) +, "{words}}, '$_';"); } store(\%data, $opt{d}) or die "Can't store '%data' in '$opt{d}'\n" +; } my $data = retrieve($opt{d}); die "Unable to retrieve from '$opt{d}'\n" if ! defined $data; my $str = join('', map {('a' .. 'z')[rand 26]} 1 .. $opt{n}); print "Working from $str\n"; # Start time my $beg = [gettimeofday]; $str = join('', sort split //, $str); my @work; for (0 .. length($str) - 3) { my $tree = $data->{substr($str, $_, 1)}; push @work, [$tree, substr($str, $_ + 1)] if $tree; } my %seen; while (@work) { my $item = pop @work; my ($data, $str) = @$item; for (@{$data->{words} || []}) { print "$_\n" if ! $seen{$_}++; } my $last_pos = length($str) - 1; for (0 .. $last_pos) { my $tree = $data->{substr($str, $_, 1)}; next if ! $tree; my $new_str = $_ < $last_pos ? substr($str, $_ + 1) : ''; push @work, [$tree, $new_str]; } } my $end = [gettimeofday]; print "Found ", scalar keys %seen, " words in ", tv_interval($beg, $en +d), " seconds\n"; sub get_args { my ($opt) = @_; my $Usage = qq{Usage: $0 <-d <datafile>> [-n <chars> -w <wordfile> + -h] -h : This help message -d : The (d)atastructure file Note: This will be where the wordlist file is stored aft +er conversion -w : The (w)ordlist file Note: Specify this option to build a new datastructure -n : The (n)umber of random characters to form words from Default: 7 } . "\n"; getopts('hd:w:n:', $opt) or die $Usage; die $Usage if $opt->{h} || ! defined $opt->{d}; $opt->{n} = 7 if ! defined $opt->{n} || $opt->{n} =~ /\D/ || $opt- +>{n} < 3; }
The alternative is to use a set of bitstrings to to index the words containing each of the letters.
I have had a note to go back and figure out what you were doing here for over a year now. Today, I sat down to do just that. Would you mind reviewing what I have and correcting anything I got wrong? Note: I rewrote it in my own style as a mechanism for understanding it.
#!/usr/bin/perl use strict; use warnings; my $file = $ARGV[0] || 'TWL06.txt'; open(my $fh, '<', $file) or die "Unable to open '$file' for reading: $ +!"; my @word; while (<$fh>) { chomp; next if length($_) < 3 || /[^a-zA-Z]/; push @word, lc($_); } my %index; for ('a' .. 'z') { # Bitstring long enough to represent all words - all bits set to 0 $index{$_} = chr(0) x int((@word + 8) / 8); } # For each unique char in every word # Set the bit corresponding to the index of the word to 1 for my $idx (0 .. $#word) { my %seen; for my $chr (sort split //, $word[$idx]) { vec($index{$chr}, $idx, 1) = 1 if ! $seen{$chr}++; } } print "Please enter an input string: "; chomp(my $input = <STDIN>); my @include = split //, $input; my @exclude = grep {! (1 + index($input, $_))} 'a' .. 'z'; # list of l +etters not in input string my $mask = chr(0) x int((@word + 8) / 8); # Turn on bits for all words that have at least 1 letter in common wit +h input string $mask |= $_ for @index{@include}; # Turn off bits for any word that contains at least 1 letter not in th +e input string for (@exclude) { # Words not containing excluded letter (though they may contain le +tters not in input string) my $remain = ~ $index{$_}; # Words that do not have excluded letter but do have letter in com +mon with input string $mask &= $remain; } for my $idx (0 .. $#word) { next if ! vec($mask, $idx, 1); # Not even a candidate next if ! finalCheck($word[$idx], $input); print "$word[$idx]\n"; } sub finalCheck { my ($candidate, $allowed) = @_; for (split //, $candidate) { $allowed =~ s/$_// or return; # return if the letter does not +remain in the candidate list } return 1; }
Assuming I understood it correctly, there isn't a lot of room for optimizations. Instead of recreating the zeroed bitstring 27 times, just do it once. The finalCheck() could be inlined (or converted to Inline::C). It may be faster to skip candidate words that are longer than the input string. You could also use Storable the same way I did to reduce the constant time construction of the data structure. I feel silly that I didn't spend some time a year ago to try and properly understand this as it is quite beautiful.
Cheers - L~R
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^3: Turning A Problem Upside Down
by BrowserUk (Patriarch) on Oct 17, 2010 at 01:40 UTC | |
by Limbic~Region (Chancellor) on Oct 17, 2010 at 04:29 UTC |