Re^2: Turning A Problem Upside Down

BrowserUk,
Seems to me that you've fallen into a pattern of looking at things in terms of combinations, permutations & powersets et al.

You are absolutely correct. I did look at the problem from many different sides but, with the subconscious objective of precomputing everything, they all turned out to be different variations on the same theme.

You could build a trie, but they are not efficient built in terms of perl's available data structures.

Assuming that the word list file will remain fairly static and assuming it transforms into a data structure (trie or trie like) small enough that can stay memory resident, this seems like a reasonable approach. Using the word list you linked to earlier (TWL06.txt) with 178,590 eligible words, I use just under 80MB with the following solution:

#!/usr/bin/perl
use strict;
use warnings;
use Getopt::Std;
use Storable;
use Time::HiRes qw/gettimeofday tv_interval/;

my %opt;
get_args(\%opt);

if (defined $opt{w}) {
    my %data;
    open(my $fh, '<', $opt{w}) or die "Unable to open '$opt{w}' for re
+ading: $!";
    while (<$fh>) {
        chomp;
        next if length($_) < 3 || /[^a-zA-Z]/;
        $_ = lc($_);
        eval join('', 'push @{$data', (map {"{$_}"} sort split //, $_)
+, "{words}}, '$_';");
    }
    store(\%data, $opt{d}) or die "Can't store '%data' in '$opt{d}'\n"
+;
}

my $data = retrieve($opt{d});
die "Unable to retrieve from '$opt{d}'\n" if ! defined $data;

my $str = join('', map {('a' .. 'z')[rand 26]} 1 .. $opt{n});
print "Working from $str\n";

# Start time
my $beg = [gettimeofday];

$str = join('', sort split //, $str);
my @work;
for (0 .. length($str) - 3) {
    my $tree = $data->{substr($str, $_, 1)};
    push @work, [$tree, substr($str, $_ + 1)] if $tree; 
}

my %seen;
while (@work) {
    my $item = pop @work;
    my ($data, $str) = @$item;
    for (@{$data->{words} || []}) {
        print "$_\n" if ! $seen{$_}++;
    }
    my $last_pos = length($str) - 1;
    for (0 .. $last_pos) {
        my $tree = $data->{substr($str, $_, 1)};
        next if ! $tree;
        my $new_str = $_ < $last_pos ? substr($str, $_ + 1) : ''; 
        push @work, [$tree, $new_str];
    }
}
my $end = [gettimeofday];
print "Found ", scalar keys %seen, " words in ", tv_interval($beg, $en
+d), " seconds\n";

sub get_args {
    my ($opt) = @_;
    my $Usage = qq{Usage: $0 <-d <datafile>> [-n <chars> -w <wordfile>
+ -h]
        -h : This help message

        -d : The (d)atastructure file
             Note:  This will be where the wordlist file is stored aft
+er conversion

        -w : The (w)ordlist file
             Note: Specify this option to build a new datastructure  

        -n : The (n)umber of random characters to form words from
             Default: 7
    } . "\n";
    getopts('hd:w:n:', $opt) or die $Usage;
    die $Usage if $opt->{h} || ! defined $opt->{d};
    $opt->{n} = 7 if ! defined $opt->{n} || $opt->{n} =~ /\D/ || $opt-
+>{n} < 3;
}
[download]

The alternative is to use a set of bitstrings to to index the words containing each of the letters.

I have had a note to go back and figure out what you were doing here for over a year now. Today, I sat down to do just that. Would you mind reviewing what I have and correcting anything I got wrong? Note: I rewrote it in my own style as a mechanism for understanding it.

#!/usr/bin/perl
use strict;
use warnings;

my $file = $ARGV[0] || 'TWL06.txt';
open(my $fh, '<', $file) or die "Unable to open '$file' for reading: $
+!";

my @word;
while (<$fh>) {
    chomp;
    next if length($_) < 3 || /[^a-zA-Z]/;
    push @word, lc($_);
}

my %index;
for ('a' .. 'z') {
    # Bitstring long enough to represent all words - all bits set to 0
    $index{$_} = chr(0) x int((@word + 8) / 8);
}

# For each unique char in every word
# Set the bit corresponding to the index of the word to 1
for my $idx (0 .. $#word) {
    my %seen;
    for my $chr (sort split //, $word[$idx]) {
        vec($index{$chr}, $idx, 1) = 1 if ! $seen{$chr}++;
    }
}

print "Please enter an input string: ";
chomp(my $input = <STDIN>);

my @include = split //, $input;
my @exclude = grep {! (1 + index($input, $_))} 'a' .. 'z'; # list of l
+etters not in input string

my $mask = chr(0) x int((@word + 8) / 8);

# Turn on bits for all words that have at least 1 letter in common wit
+h input string 
$mask |= $_ for @index{@include};

# Turn off bits for any word that contains at least 1 letter not in th
+e input string
for (@exclude) {
    # Words not containing excluded letter (though they may contain le
+tters not in input string)
    my $remain = ~ $index{$_};
    # Words that do not have excluded letter but do have letter in com
+mon with input string 
    $mask &= $remain;
}

for my $idx (0 .. $#word) {
    next if ! vec($mask, $idx, 1); # Not even a candidate
    next if ! finalCheck($word[$idx], $input);
    print "$word[$idx]\n";
}

sub finalCheck {
    my ($candidate, $allowed) = @_;
    for (split //, $candidate) {
        $allowed =~ s/$_// or return; # return if the letter does not 
+remain in the candidate list
    }
    return 1;
}
[download]

Assuming I understood it correctly, there isn't a lot of room for optimizations. Instead of recreating the zeroed bitstring 27 times, just do it once. The finalCheck() could be inlined (or converted to Inline::C). It may be faster to skip candidate words that are longer than the input string. You could also use Storable the same way I did to reduce the constant time construction of the data structure. I feel silly that I didn't spend some time a year ago to try and properly understand this as it is quite beautiful.

Cheers - L~R

Comment on Re^2: Turning A Problem Upside Down Select or Download Code

Replies are listed 'Best First'.
Re^3: Turning A Problem Upside Down by BrowserUk (Patriarch) on Oct 17, 2010 at 01:40 UTC
Using the word list you linked to earlier (TWL06.txt) with 178,590 eligible words, I use just under 80MB with the following solution: You might be able to reduce that a bit and gain a little speed by using arrays instead of hashes. I tested this by replacing your trie builder with: `if (defined $opt{w}) { my @data; open(my $fh, '<', $opt{w}) or die "Unable to open '$opt{w}' for re +ading: $!"; while (<$fh>) { chomp; next if length($_) < 3 \|\| /[^a-zA-Z]/; $_ = lc($_); my $code = join('', 'push @{$data', (map { $_-=97; "[$_]"} sor +t{$a<=>$b} unpack 'C', $_), "}, '$_';"); eval $code; } store(\@data, $opt{d}) or die "Can't store '%data' in '$opt{d}'\n" +; }` [download] And the resultant file is just 8MB rather than 80MB. I started to modify the rest of teh code to use it, but then got lost and gave up, but arrays should be quicker than hashes. Note: I rewrote it in my own style as a mechanism for understanding it.* I do it myself all the time. (I hate what you did with it! But that's a (probably pointless) argument for another time. :) Assuming I understood it correctly You're spot on. there isn't a lot of room for optimizations. It only takes 5 seconds to build the index, so that doesn't seem to warrant the effort. And if I feed it the top 7 characters ordered by frequency in the dictionary: esiarnt--which shoudl be close to worst case--it only takes 0.7 of a second to find the 243 complient words, so there's not much reason to optimise there either. it is quite beautiful I think that Perl's bitwise logical operators are one of it's best kept secrets. They lend themselves to performing a huge variety of set-type operations, very quickly, in a single operation. Effectively 'in parallel'. They can perform a million SELECT-style operations on sets of 8000 items in 1.3 seconds: `$a = ~( $b = chr(0)x 1e3 ); $t=time; my $x = $a ^ $b for 1 .. 1e6; printf "Took %.6f seconds\n", time()-$t;; Took 1.297000 seconds` [download] Or a thousand such operations upon sets of 8 million items in 1.7 seconds: `$a = ~( $b = chr(0)x 1e6 ); $t=time; my $x = $a ^ $b for 1 .. 1e3; printf "Took %.6f seconds\n", time()-$t;; Took 1.645000 seconds` [download] And storage usage doesn't get much better than 1-bit per item. I've never actually looked to see how tightly coded the underlying opcodes are. I've never found the need to. Maybe I should one day though, as these tend to be the dark corners of the codebase that never get revisited. It's quite possible that they are currently written to operate byte-by-byte which is generally far slower than dealing with data in register-sized chunks. With the advent of 64-bit registers, it is quite possible that they could be speed up significantly. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. RIP an inspiration; A true Folk's Guy	[reply] [d/l] [select]
Re^4: Turning A Problem Upside Down by Limbic~Region (Chancellor) on Oct 17, 2010 at 04:29 UTC
BrowserUk, You might be able to reduce that a bit and gain a little speed by using arrays instead of hashes. Actually, using an array takes up more space in memory (roughly the same on disk). This is because at a branch where the only valid node is 'z', a hash will insert one item where the array will insert 26 (25 of which are undef). Also, the array code would have to be modified to: `my $code = join('', 'push @{$data', (map { $_-=97; "[$_]"} sort{$a<=>$ +b} unpack 'C', $_), "[26]}, '$_';");` [download] This is because in an array the slots are positional and represent characters so a special slot (much like the key 'words' in the hash) needs to be created to store the words. And the resultant file is just 8MB rather than 80MB. I started to modify the rest of teh code to use it, but then got lost and gave up,* As I indicated via /msg, on disk they are about the same but the hash is about 77-78MB in memory. I am not sure if it was my code that got you lost or the translation from hash to array. In the event that it was my code, let me briefly explain. First, all the words in the dictionary are converted: `# split and sort art => push @{$data{a}{r}{t}{words}}, 'art'; rat => push @{$data{a}{r}{t}{words}}, 'rat'; tar => push @{$data{a}{r}{t}{words}}, 'tar';` [download] This allows us to walk the tree and know from where we are, what letters can possibly form a new word. Next, the initial work stack is populated from the input string (also sorted): `aeinrst => ( $data->{a}, einrst $data->{e}, inrst $data->{i}, nrst $data->{n}, rst $data->{r}, st )` [download] We can stop there because words must be a minimum of 3 chars. Now each time we take an item off the stack, we check to see if the branch we are on has any words and print them out. We then iterate over the currently unused letters and place any potential branches from our current position: `my $item = pop @work; my ($tree, $str) = @$item; # $data->{a}, einrst push @work, [$data->{a}{e}, 'inrst'] if $data->{a}{e}; push @work, [$data->{a}{i}, 'nrst'] if $data->{a}{i}; push @work, [$data->{a}{n}, 'rst'] if $data->{a}{n}; push @work, [$data->{a}{r}, 'st'] if $data->{a}{r}; push @work, [$data->{a}{s}, 't'] if $data->{a}{s}; push @work, [$data->{a}{t}, ''] if $data->{a}{t};` [download] As you can see, I push items on the stack that can't possibly lead to a valid word (too short). I thought about tracking length and short circuiting like I do on initial population but it was just too fast as is to bother with. There are other opportunities for short circuiting due to duplicates in the input string but they too didn't seem to be worth the effort. I hate what you did with it! I was overly verbose in order to make sure if I didn't understand something, I could break it up into small pieces and comment each one. I doubt I would have been quite as terse as yours had I written it naturally, but I don't like my version much either. It only takes 5 seconds to build the index, so that doesn't seem to warrant the effort. Sure, but your comment (Not sure how my crude implementation stack up against yours, but it should compare favourably) from a year ago made me look to see if there was anything obvious. I have a knack for wasting time learning things the hard way. In this case, I was just thinking out loud. Cheers - L~R	[reply] [d/l] [select]