comment on

Good day! I'm trying to write a script that takes a list of words and checks to see if each word exists within another word in the list. The specifics are as follows:

1) The word list contains 640,000 entries

2) A word cannot match itself (ie: "a" cannot match "a")

3) A word cannot match itself as a plural even if it makes a different word (ie: "a" cannot match "as")

4) A word cannot match itself with an apostrophe s "'s" (ie: "a" cannot match "a's" but "a" can match "aa's")

Coming from a mainframe background I am trying to use loops but the performance is horrible. From what i've read, hashing seems to be the way to go but I think I am still implementing this as a loop and getting very poor performance (100 records in 20 seconds).

Here is what i've tried so far:

use warnings;
use strict;

#define constants
my $datapath="F:\\wordsinwords\\";
my $wordfile= $datapath."wordlist.txt";

#define variables
my $outrecs=0;
my $word;

open LOG, ">".$datapath."wordsinwords_LOG.txt" or die $!;
select LOG;
$|=1;

#read the wordlist file into an array
open (WORDFILE, $wordfile);
chomp (@words = (<WORDFILE>));
close (WORDFILE);

#main
#coerce the array into a hash
%hash = map { $_ => 1 } @words; 

#search for matches 
#I have no idea how to put a single
#regex together that could meet all of the criteria so
#I was going to run this multiple times to target specific
#criteria until I found all permutations.
foreach $word (keys %hash) {
    $outrecs++ if /.+$searchword/ ~~ %hash; 
}

#===============================================
# This was an attempt at using an array
# but it was also very slow
#===============================================
#foreach $word (@words) {
#    $outrecs++ if ($found) = grep (/.+?$word/, @words);
#}

#close files & write out completion log
print LOG "Created output file with: ".$outrecs." records.\n";
close LOG;
[download]

Any suggestions would be greatly appreciated. Thank you!

In reply to Words in Words by sarchasm

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Don't ask to ask, just ask
	PerlMonks