in reply to Re^2: Turning A Problem Upside Down
in thread Turning A Problem Upside Down

Using the word list you linked to earlier (TWL06.txt) with 178,590 eligible words, I use just under 80MB with the following solution:

You might be able to reduce that a bit and gain a little speed by using arrays instead of hashes. I tested this by replacing your trie builder with:

if (defined $opt{w}) { my @data; open(my $fh, '<', $opt{w}) or die "Unable to open '$opt{w}' for re +ading: $!"; while (<$fh>) { chomp; next if length($_) < 3 || /[^a-zA-Z]/; $_ = lc($_); my $code = join('', 'push @{$data', (map { $_-=97; "[$_]"} sor +t{$a<=>$b} unpack 'C*', $_), "}, '$_';"); eval $code; } store(\@data, $opt{d}) or die "Can't store '%data' in '$opt{d}'\n" +; }

And the resultant file is just 8MB rather than 80MB. I started to modify the rest of teh code to use it, but then got lost and gave up, but arrays should be quicker than hashes.

Note: I rewrote it in my own style as a mechanism for understanding it.

I do it myself all the time. (I hate what you did with it! But that's a (probably pointless) argument for another time. :)

Assuming I understood it correctly

You're spot on.

there isn't a lot of room for optimizations.

It only takes 5 seconds to build the index, so that doesn't seem to warrant the effort.

And if I feed it the top 7 characters ordered by frequency in the dictionary: esiarnt--which shoudl be close to worst case--it only takes 0.7 of a second to find the 243 complient words, so there's not much reason to optimise there either.

it is quite beautiful

I think that Perl's bitwise logical operators are one of it's best kept secrets.

They lend themselves to performing a huge variety of set-type operations, very quickly, in a single operation. Effectively 'in parallel'. They can perform a million SELECT-style operations on sets of 8000 items in 1.3 seconds:

$a = ~( $b = chr(0)x 1e3 ); $t=time; my $x = $a ^ $b for 1 .. 1e6; printf "Took %.6f seconds\n", time()-$t;; Took 1.297000 seconds

Or a thousand such operations upon sets of 8 million items in 1.7 seconds:

$a = ~( $b = chr(0)x 1e6 ); $t=time; my $x = $a ^ $b for 1 .. 1e3; printf "Took %.6f seconds\n", time()-$t;; Took 1.645000 seconds

And storage usage doesn't get much better than 1-bit per item. I've never actually looked to see how tightly coded the underlying opcodes are. I've never found the need to.

Maybe I should one day though, as these tend to be the dark corners of the codebase that never get revisited. It's quite possible that they are currently written to operate byte-by-byte which is generally far slower than dealing with data in register-sized chunks. With the advent of 64-bit registers, it is quite possible that they could be speed up significantly.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
RIP an inspiration; A true Folk's Guy

Replies are listed 'Best First'.
Re^4: Turning A Problem Upside Down
by Limbic~Region (Chancellor) on Oct 17, 2010 at 04:29 UTC
    BrowserUk,
    You might be able to reduce that a bit and gain a little speed by using arrays instead of hashes.

    Actually, using an array takes up more space in memory (roughly the same on disk). This is because at a branch where the only valid node is 'z', a hash will insert one item where the array will insert 26 (25 of which are undef). Also, the array code would have to be modified to:

    my $code = join('', 'push @{$data', (map { $_-=97; "[$_]"} sort{$a<=>$ +b} unpack 'C*', $_), "[26]}, '$_';");
    This is because in an array the slots are positional and represent characters so a special slot (much like the key 'words' in the hash) needs to be created to store the words.

    And the resultant file is just 8MB rather than 80MB. I started to modify the rest of teh code to use it, but then got lost and gave up,

    As I indicated via /msg, on disk they are about the same but the hash is about 77-78MB in memory. I am not sure if it was my code that got you lost or the translation from hash to array. In the event that it was my code, let me briefly explain. First, all the words in the dictionary are converted:

    # split and sort art => push @{$data{a}{r}{t}{words}}, 'art'; rat => push @{$data{a}{r}{t}{words}}, 'rat'; tar => push @{$data{a}{r}{t}{words}}, 'tar';
    This allows us to walk the tree and know from where we are, what letters can possibly form a new word. Next, the initial work stack is populated from the input string (also sorted):
    aeinrst => ( $data->{a}, einrst $data->{e}, inrst $data->{i}, nrst $data->{n}, rst $data->{r}, st )
    We can stop there because words must be a minimum of 3 chars. Now each time we take an item off the stack, we check to see if the branch we are on has any words and print them out. We then iterate over the currently unused letters and place any potential branches from our current position:
    my $item = pop @work; my ($tree, $str) = @$item; # $data->{a}, einrst push @work, [$data->{a}{e}, 'inrst'] if $data->{a}{e}; push @work, [$data->{a}{i}, 'nrst'] if $data->{a}{i}; push @work, [$data->{a}{n}, 'rst'] if $data->{a}{n}; push @work, [$data->{a}{r}, 'st'] if $data->{a}{r}; push @work, [$data->{a}{s}, 't'] if $data->{a}{s}; push @work, [$data->{a}{t}, ''] if $data->{a}{t};
    As you can see, I push items on the stack that can't possibly lead to a valid word (too short). I thought about tracking length and short circuiting like I do on initial population but it was just too fast as is to bother with. There are other opportunities for short circuiting due to duplicates in the input string but they too didn't seem to be worth the effort.

    I hate what you did with it!

    I was overly verbose in order to make sure if I didn't understand something, I could break it up into small pieces and comment each one. I doubt I would have been quite as terse as yours had I written it naturally, but I don't like my version much either.

    It only takes 5 seconds to build the index, so that doesn't seem to warrant the effort.

    Sure, but your comment (Not sure how my crude implementation stack up against yours, but it should compare favourably) from a year ago made me look to see if there was anything obvious. I have a knack for wasting time learning things the hard way. In this case, I was just thinking out loud.

    Cheers - L~R