in reply to Re^3: Turning A Problem Upside Down
in thread Turning A Problem Upside Down

BrowserUk,
You might be able to reduce that a bit and gain a little speed by using arrays instead of hashes.

Actually, using an array takes up more space in memory (roughly the same on disk). This is because at a branch where the only valid node is 'z', a hash will insert one item where the array will insert 26 (25 of which are undef). Also, the array code would have to be modified to:

my $code = join('', 'push @{$data', (map { $_-=97; "[$_]"} sort{$a<=>$ +b} unpack 'C*', $_), "[26]}, '$_';");
This is because in an array the slots are positional and represent characters so a special slot (much like the key 'words' in the hash) needs to be created to store the words.

And the resultant file is just 8MB rather than 80MB. I started to modify the rest of teh code to use it, but then got lost and gave up,

As I indicated via /msg, on disk they are about the same but the hash is about 77-78MB in memory. I am not sure if it was my code that got you lost or the translation from hash to array. In the event that it was my code, let me briefly explain. First, all the words in the dictionary are converted:

# split and sort art => push @{$data{a}{r}{t}{words}}, 'art'; rat => push @{$data{a}{r}{t}{words}}, 'rat'; tar => push @{$data{a}{r}{t}{words}}, 'tar';
This allows us to walk the tree and know from where we are, what letters can possibly form a new word. Next, the initial work stack is populated from the input string (also sorted):
aeinrst => ( $data->{a}, einrst $data->{e}, inrst $data->{i}, nrst $data->{n}, rst $data->{r}, st )
We can stop there because words must be a minimum of 3 chars. Now each time we take an item off the stack, we check to see if the branch we are on has any words and print them out. We then iterate over the currently unused letters and place any potential branches from our current position:
my $item = pop @work; my ($tree, $str) = @$item; # $data->{a}, einrst push @work, [$data->{a}{e}, 'inrst'] if $data->{a}{e}; push @work, [$data->{a}{i}, 'nrst'] if $data->{a}{i}; push @work, [$data->{a}{n}, 'rst'] if $data->{a}{n}; push @work, [$data->{a}{r}, 'st'] if $data->{a}{r}; push @work, [$data->{a}{s}, 't'] if $data->{a}{s}; push @work, [$data->{a}{t}, ''] if $data->{a}{t};
As you can see, I push items on the stack that can't possibly lead to a valid word (too short). I thought about tracking length and short circuiting like I do on initial population but it was just too fast as is to bother with. There are other opportunities for short circuiting due to duplicates in the input string but they too didn't seem to be worth the effort.

I hate what you did with it!

I was overly verbose in order to make sure if I didn't understand something, I could break it up into small pieces and comment each one. I doubt I would have been quite as terse as yours had I written it naturally, but I don't like my version much either.

It only takes 5 seconds to build the index, so that doesn't seem to warrant the effort.

Sure, but your comment (Not sure how my crude implementation stack up against yours, but it should compare favourably) from a year ago made me look to see if there was anything obvious. I have a knack for wasting time learning things the hard way. In this case, I was just thinking out loud.

Cheers - L~R