Re: Finding dictionary words in a string.
by Caron (Friar) on Mar 13, 2004 at 06:58 UTC
|
contains the words "apple", "cherry" and "blossom".
Actually, there is much more ...
#!/usr/bin/perl -w
use strict;
my $search_word = shift
or die "search word required\n";
my @words = ();
my $count = 0;
open WORDS, "/usr/share/dict/words"
or die "can't open words file\n";
while (<WORDS>) {
chomp;
printf "%3d %s\n", ++$count, $_ if $search_word =~ /$_/i;
}
close WORDS;
And here is the result:
$ perl wsearch.pl owijfwapplelaskfiwejfcherryalkfwiofwfblossomowiejf
1 Al
2 apple
3 as
4 ask
5 blossom
6 cherry
7 err
8 he
9 her
10 Herr
11 Io
12 Los
13 loss
14 mow
15 of
16 so
17 we
I know it's a brute force, but it takes less than half second to give this result.
Consider that Perl RegEx engine is using a Boyer-Moore search algorithm, which is probably the fastest one available for text matching. | [reply] [d/l] |
|
This is a great little program!
I couldn't resist speeding
it up a bit. A commonly used trick in
search engines is to
shift everything to upper case.
Avoiding the i
flag in the regular expression roughly doubles
the speed, on my machine at least.
I also used Storable to create a slightly faster-loading
dictionary.
#!/usr/bin/perl -w
use strict;
use Storable;
my $search_word = uc shift
or die "search word required\n";
my @words = ();
my @dict;
my $rdict;
my $count = 0;
if (not -e 'words.sto') {
open WORDS, '/usr/share/dict/words'
or die "can't open words file\n";
while (<WORDS>) {
chomp;
push @dict, uc $_;
}
close WORDS;
$rdict= \@dict;
store $rdict, 'words.sto';
}
else {
$rdict= retrieve('words.sto');
}
for (@$rdict) {
printf "%3d %s\n", ++$count, $_ if $search_word =~ /$_/;
}
It should work perfectly the first time! - toma
| [reply] [d/l] [select] |
|
I can't figure out how you got that speed improvement. The
time you save by removing the /i flag is lost when you
build a large array in memory (45,000 words in my
machine). Besides, do you really think an
all-uppercase output is better?
$ time perl wsearch.pl owijfwapplelaskfiwejfcherryalkfwiofwfblossomowiejf
1 Al
2 apple
3 as
4 ask
5 blossom
6 cherry
7 err
8 he
9 her
10 Herr
11 Io
12 Los
13 loss
14 mow
15 of
16 so
17 we
0.30user 0.00system 0:00.30elapsed 99%CPU
$ time perl wsearch2.pl owijfwapplelaskfiwejfcherryalkfwiofwfblossomowiejf
1 AL
2 APPLE
3 AS
4 ASK
5 BLOSSOM
6 CHERRY
7 ERR
8 HE
9 HER
10 HERR
11 IO
12 LOS
13 LOSS
14 MOW
15 OF
16 SO
17 WE
0.48user 0.00system 0:00.48elapsed 99%CPU
Of course, I ran both scripts at least 4 times, to make sure that
the storable file was created and that both scripts were reading their
input from the disk cache. | [reply] |
|
|
Here are the timings for my machine. With this type
of program, the results will depend on the relative
performance of the hard disk, the CPU, the memory,
and the caches.
Even on a single machine, the results depend on
the operating system kernel, and most importantly,
with the build of perl. My vendor perl is an unusual
build, and my own build is more tuned to my machine.
The kernel I have been running, 2.4.22-6.ll.rh80,
is an experimental low-latency kernel used for
real-time applications. The file system is ext3.
Timing, seconds |
Kernel 2.4.22-6.ll.rh80 |
Kernel 2.4.18-14 |
Vendor perl, loop |
2.938 |
2.569 |
My perl, loop |
1.957 |
1.677 |
Vendor perl, Storable, no /i |
1.267 |
0.966 |
My perl, Storable, no /i |
0.687 |
0.567 |
|
If you prefer lower case to make the output look
better, just use the lc function instead
of the uc function.
I think the lesson is that performance improvements
are often difficult to generalize, so you have to
try it on your own system and see what works.
UPDATE:
The Code Smarter suggestion is very good.
For the Storable code, timings for the vendor perl on
2.4.22-6.ll.rh80 the timings are down to about
0.21 seconds and with my perl they are at about
0.13 seconds - too fast for my primitive
benchmark.
It should work perfectly the first time! - toma
| [reply] [d/l] [select] |
Re: Finding dictionary words in a string.
by Corion (Patriarch) on Mar 13, 2004 at 09:06 UTC
|
To expand on the idea of the Boyer-Moore search, I think the fastest way to find if your words are contained in a string is a trade off in startup time versus runtime, as the startup cost can be cached.
The idea is to create a (huge) state machine from your dictionary that outputs a word after it's been found, while inching through your string. I'm not sure if the idea is feasible, as the amount of memory for storing the machine isn't negligible for large dictionaries. I currently think of using hashes for the word-tree, but possibly arrays are faster. The fastest would be to build a huge string out of the arrays and then use a tight C loop to run over it - demperhq did something similar (longest prefix matching I think). An example tree could be like this:
# a negative number indicates how many chars you just matched
a -> p -> p -> l -> e -> -5
b -> a -> r -> -3
-> l -> o -> s -> s -> o -> m -> -7
-> u -> s -> e -> -6
u -> s -> e -> -3
The idea now is to inch through the string, keeping track of the "running" words, advancing them all down trough the tree, and adding a new "running word" for the current character, starting at the top. If you don't find the current char in the tree at the current position, there was no word. If you end up at a negative number, your word started that many characters away.
Building that tree might be time consuming, but you can store it after creation. Walking that tree should be fairly fast - you might consider doing that in C if speed becomes the central issue.
Update: After thinking a short bit, and writing code a short bit, here is my try at it. I didn't optimize the code, so the matching loop might still bear potential for optimization, for example the cleaning out of non-matching words, and the calculation of the matched string.
Update 2: After thinking about it shortly, this only finds the longest word starting at each character, and could miss out on bee, if there are bee, beetle in the dictionary and only beetl in the string. I've updated the code to catch such instances as well - now it outputs all matches, even submatches, such as bee in beetle.
#!/usr/bin/perl -w
use strict;
use Data::Dumper;
my $dict = {};
while (<DATA>) {
chomp;
insert($_);
};
sub insert {
my @chars = split '', shift;
my $len = - scalar @chars;
my $curr = $dict;
while (@chars) {
my $char = shift @chars;
if (! exists $curr->{$char}) { $curr->{$char} = {} };
$curr = $curr->{$char};
};
$curr->{''} = $len;
};
# print Dumper $dict;
my $str = 'owijfwapplelaskfiwejfcherryalkfwiofwfblossomowiejfbeetl';
my $pos = 0;
my @running;
while ($pos < length $str) {
my $char = substr( $str, $pos, 1 );
push @running, $dict;
for my $word (@running) {
if (exists $word->{''}) {
print "Found ", substr( $str, $pos + $word->{''}, - $word->{''}
+), "\n";
};
if (exists $word->{$char}) {
$word = $word->{$char};
} else {
undef $word;
};
};
@running = grep { defined } @running;
$pos++;
};
__DATA__
apple
blossom
cherry
blouse
bar
bee
beetle
| [reply] [d/l] [select] |
Re: Finding dictionary words in a string.
by kvale (Monsignor) on Mar 13, 2004 at 04:35 UTC
|
Instead of testing the string against every dictionary word, it would be better to search along the string for words that fit the current pattern. Suppose that you have some minmum word length n of dictionary words. Then at each position in the string, get that substring of length n and search your dictionary of sorted words for that pattern; this is a O(n*log(n)) operation. Then check each of the possible words against the rest of the string. For a reasonable n, this will cut down search time drastically.
This basic idea carried to its logical conclusion forms the basis for a using a trie data structure to find words fast. There is a module Text::Trie that implements this:
use Tree::Trie;
use strict;
my($trie) = new Tree::Trie;
$trie->add(qw[aeode calliope clio erato euterpe melete melpomene mneme
+ polymnia terpsichore thalia urania]);
my(@all) = $trie->lookup("");
my(@ms) = $trie->lookup("m");
$" = "--";
print "All muses: @all\nMuses beginning with 'm': @ms\n";
my(@deleted) = $trie->remove(qw[calliope thalia doc]);
print "Deleted muses: @deleted\n";
A trie consisting of 250,000 words will take up a good deal of space, but even a truncate trie will speed things up.
| [reply] [d/l] |
Re: Finding dictionary words in a string.
by tachyon (Chancellor) on Mar 13, 2004 at 14:05 UTC
|
Text::ExtractWords is implmented. I presume you make this statement as you don't understand why the .pm is virtually empty and only contains stuff about Dynaloader and bootstrap. The code itself is in C in ExtractWords.xs. You will need a C compiler to install it.
| [reply] |
|
I made the statement because according http://search.cpan.org, the .pm file is the only file in the distro and the whole package is at version 0.08.
| [reply] |
|
That is simply incorrect. You really need to learn how to navigate CPAN better before making such erroneous statements.
http://search.cpan.org/~hdias/Text-ExtractWords-0.08/ - have a close look.
The version is 0.08 with an update time of 13 Oct 2003. Look at the list of Special files and as usual you will see Changes MANIFEST
Makefile.PL README. You then see a list of .pm files, in this case 1. The xs and C files are never listed on this page AFAIK.
If you click the Browse link near top right next to the date you will see all the files in the distro including the /t dir and C code files like .h .c .xs and anything else included in the distro. Browse simply takes you to a dir where the distro you get from download was unpacked for your viewing pleasure.
If you want to see the modules history read the Changes file. Version numbers per se depend entirely on the authors whim. I start at 0.01 and just keep moving up. I don't think anything I have written has gotten to 1.x yet. Others authors start at 1.0 for the initial release so there is little value in the numbering. In the case of this module it was released May last year and had a series of updates.
| [reply] |
|
I think this is the third time in the last couple weeks I've seen a negative reference about a module with a version < 1.
It is very common for modules to start with version 0.01. If you are looking for an indicator of stability/maturity,
the version number is not the place to look. Read the pod, the changes, the code; ignore the version.
Caveat: a beta version is indicated by an underscore in the version string, e.g. Foo-Bar-0.01_01.
| [reply] |
Re: Finding dictionary words in a string.
by TilRMan (Friar) on Mar 13, 2004 at 05:58 UTC
|
#!/usr/bin/perl -w
use strict;
open WORDS, "/usr/share/dict/words" or die;
my @words = <WORDS>;
chop @words;
close WORDS;
printf "Loaded %d words.\n", scalar @words;
# Favor long words
@words = sort { length $b <=> length $a } @words;
my $re = join '|', map quotemeta, @words;
while (<>)
{
study; # ??
1 while s/$re/[*]/io;
print;
}
| [reply] [d/l] |
|
Okay, then, how fast is a not-so-naive solution for you? It's still not a real algorithm for finding likely English words, though. (Seems like there should be one out there, but I've never seen it.) So I improvise. This one generates a list of letter triples from the word list and finds those first. That speeds things up considerably.
One thing that is bothering me is that you said you need to score each string by how well the words fit together. That sounds like a hairy problem if there are lots of ways to divide a string into words, each with different leftover garbage letters, different word lengths, etc. Whatever method you use to find the words will probably need to know something about how you score the strings so that it can look for the best fit.
my @words = #...;
my %triples;
foreach (@words) {
for (my $i = 0; $i < length($_) - 2; $i++) {
push @{$triples{substr $_, $i, 3} ||= []}, $_;
}
}
# Keep most popular triples
my @triples = grep { @{$triples{$_}} > 10 } keys %triples;
my %lensum;
foreach my $t (@triples) {
foreach my $w (@{$triples{$t}}) {
$lensum{$t} += length $w;
}
}
@triples = sort { $lensum{$b} <=> $lensum{$a} } @triples;
my $re = join '|', map quotemeta, @triples;
while (<>) {
print;
my @hopes;
while (/$re/gio) {
push @hopes, @{$triples{$&}};
}
my $word = join '|', map quotemeta,
sort { length($b) <=> length($a) } @hopes;
1 while s/$word/[*]/i;
print;
}
| [reply] [d/l] |
Re: Finding dictionary words in a string.
by Vautrin (Hermit) on Mar 13, 2004 at 05:57 UTC
|
If you want speed, perhaps you should search for syllables or several letters. For instance, the string "tccvssqqqlllxxccvv" is unlikely to contain any english words because it's all consonants. You could have a list of exceptions to search for (for instance, "cvs"), which would be much smaller (and thus much quicker) then using the entire dictionary. You could also extend the concept to vowels (i.e. "aeuaeey" can contain no words because it contains vowels which are not i (a word).
It's definitely possible to process your dictionary by running it through a program to chop it up into syllables. (And by syllables, I mean a combination of n letters. The more letters you use the more accurate, and the slower the algorithm will be). Add to that a few common sense rules like the one about all consonants for 10 syllables probably not being a word, and it should work.
Want to support the EFF and FSF by buying cool stuff? Click here.
| [reply] [d/l] |
Re: Finding dictionary words in a string.
by Anonymous Monk on Mar 13, 2004 at 04:32 UTC
|
#!c:/perl/bin/perl -w
use strict;
my %index;
open my $fh, '<', 'dict.dat' or die "open failed: $!";
while (<$fh>) {
chomp;
next if length($_) < 3;
push @{$index{ lc(substr($_, 0, 3)) }}, $_;
}
close $fh;
#print join(', ', map { @$_ } $index{'rad'}), $/;
my $str = 'owijfwapplelaskfiwejfcherryalkfwiofwfblossomowiejf';
for my $i (0 .. length($str)-3) {
for my $j (3..length($str)-$i) {
my $substr = substr($str, $i, $j);
my $index = substr($substr, 0, 3);
next unless exists($index{$index});
for (map { @$_ } $index{$index}) {
print $_, $/ if $_ eq $substr;
}
}
}
| [reply] [d/l] |
Re: Finding dictionary words in a string.
by graff (Chancellor) on Mar 13, 2004 at 18:40 UTC
|
The scoring you describe seems like the more interesting part, and could guide the choice of a search method, but there's not enough detail in the OP figure it out.
In particular, there needs to be some sort of trade-off between the two scoring criteria: "strings that are composed entirely of words get a better score" is somewhat at odds with "strings with the fewest number of words will get a higher score". Are you talking about solutions that involve just non-overlapping words? If so, then you would need to compare various possible parses of the input string. How would you score a string like "labsolve"? Does "lab, solve" score better or worse than "absolve, one letter left unused"?
In either case, it might be helpful to have your dictionary stored as an array of hashes; the index into the array is the length of words in that hash. (The hashes themselves could be sub-indexed by first letter, perhaps, or first two or three letters, as suggested by the AM in the first reply.) This way, you know before starting on the input string what the longest substring is that you need to match against, and for each substring that you test, you only need to compare it to a limited number of dictionary entries. | [reply] |
|
What I meant was that the entire string "Hello" gets a better score than "HelloHowAreYou" because there are fewer words. But either of those get a better score than "oijwfHellowoifef", for example.
There needs to be some tradeoff, which I'm not quite sure how to judge at the moment. For example "ThisIsAStringThatGoesOnAndOnAndOnForever" and "perlxmonks" would probably have a nearly equivalent score because the first one doesn't have any junk characters, but has a high word count while the second one has only a few junk characters, but a low word count
| [reply] |
|
# get the domain part dropping www. and passing back the
# domain and tld (ie .com .net) or sld.tld (ie co.uk, com.au )
my ($domain, $tld) = get_domain( $url );
# chop domain into all possible substrings say 3-16 chars long, retrun
+ ary ref
# there are very few valid well known words > 16 chars, virtually none
+ > 24 chars
my $tokens = tokenize( $domain );
# get the possible words ordered by length(word) DESC ie longest first
# use a hash lookup or a RDBMS with a dynamicly generated SQL IN claus
+e
my $words = get_real_words_from_dict( $tokens )
# substitute out the words, as we remove longest first
# we aviod finding substrings like 'be' in 'beetle'
my $residual = $domain;
my @words = ();
for my $word( @$words ) {
# we may have duplicates of same word
push @words, $word while $residual =~ s/\Q$word\E/;
}
# remove '-' from residual so 'come-here' will be two words, no residu
+al
$residual =~ s/-//g;
# work out % residual
$residual = 100*$residual/$domain;
# So now we have our data
# @words 0..N is number of words found
# $residual is the %residual on that domain name
# $tld is the domain name
# say we inserted into a Db table like:
CREATE TABLE urls (
url CHAR(75),
words INT,
residual FLOAT,
tld CHAR(10),
)
"INSERT INTO urls (url,words, residual, tld) VALUES(?,?,?,?)",
$url, scalar @words, $residual, $tld
You can now generate reports. Essentially you want something like:
SELECT url FROM urls
WHERE words >= 1
ORDER BY words ASC, residulal ASC
GROUP BY tld
This does not apply limits or add bias for say a pref for .com domain names. It will output urls with single word, lowest->highest residual first, then two words etc. Given what you want if the residual is > 10-20% you can probably just ignore those URLs and not insert them.
| [reply] [d/l] |
Re: Finding dictionary words in a string.
by bart (Canon) on Mar 16, 2004 at 20:37 UTC
|
I have some serious doubts on the scale of the operation, but perhaps you could try applying Regex::PreSuf on the dictionary words, and match the string using this constructed regex. It looks fine in theory, except probably for memory constraints, and perhaps more stringent constraints on the size of the compiled regex.
| [reply] |