in reply to Comparing Lines within a Word List

So you have a looming deadline and no idea how to solve your problem (every CS problem can be solved "using regular expressions in a Perl script"), so you come here to ask us to do your homework for you... well today is your lucky day :-P

My first thought was "spell checker", so I investigated how they work. Apparently one possible implementation is the use of a Trie. I've never worked with a trie before, so I decided to use this chance to try it out. Since it's the first time I'm working with tries, the code might have some bugs. I compared it to a plain, brute-force linear search regex match. In my first tests, the trie data structure takes roughly 5x the memory (as reported by Devel::Size), but on large word lists I saw a 100x to 700x speedup. (Note: My implementation of AnomalousMonk's xor+tr search seems to be roughly twice as fast as the regex implementation.)

If you want to understand what's going on in the code, the book Learning Perl is a good start, as is perldsc. Also, ask here.

(Protip: Don't get thrown out of your class or school for plagiarism and cite your sources.)

Replies are listed 'Best First'.
Re^2: Comparing Lines within a Word List
by dominick_t (Acolyte) on Apr 27, 2016 at 16:19 UTC

    Just to clarify, I'm not a computer science student and my deadline is not a class deadline. I'm an artist with a background in mathematics, and I also occasionally publish crosswords and other word puzzles. My sense was that Perl and regular expressions could be enormously helpful in the latter, so I've been teaching myself with the O'Reilly book, but haven't learned enough yet to solve on my own a particular matching question that I could really use an answer to in the next few days, as it's to do with a printer deadline for a puzzle that I'm writing for my best friend's wedding.

    Thanks for these thoughts on the spell-checking approach. The code looks enormously interesting but obviously a lot to work through, and maybe more than I need at the moment? The solutions above appear to be able to tackle my specific problem, but perhaps there is something I'm not seeing.

      Try this with your dictionary file. If performance is a problem break into separate files according to word length.

      #!perl use strict; open IN,'dict.txt' or die "$!"; my %dict=(); for (<IN>){ chomp; $dict{uc $_}=1; }; close IN; for my $word (sort keys %dict){ next unless $word =~ /R/; my @f = split //,$word; # loop over each letter # changing R to S # to create new word in $w for my $i (0..$#f){ if ($f[$i] eq 'R'){ my $w = $word; substr($w,$i,1) = "S"; # check if generated word exists in dict if (exists $dict{$w}){ print "$word $w\n"; } } } }
      poj

      That's fine of course, we just get a lot of "do my homework for me" type questions here. The code works (at least on my machine) and as far as I can tell from your descriptions in this thread it generates the output you're looking for, so feel free to just use it. The learning curve of this particular code may be a little steep, and it contains a lot of benchmarking/statistics code which of course you don't need, but at least the theory behind a Trie is still something you can look into if you like. Doing a linear search on a long word list is inefficient when run often (which doesn't seem to be the case in your situation) so the size/speed tradeoff of a trie can be very useful.