Maybe try frequency counting? Re: Matching non-words

Very interesting problem! ++Matts!

How about frequency counting?

#!/usr/bin/perl -w
use strict;

open(DICT,"/usr/dict/words") or die $!;
my %freq;

#find the frequency count
while (<DICT>) {
        next if /2/;  #Modula-2
        next if /3/;  #Modula-3 !
        chomp;
        $freq{$_}++ for split //,lc;
}
my $sum; $sum += $freq{$_} for keys %freq ;
$freq{$_} = ($freq{$_} / $sum)*100 for keys %freq;

#test some words and non-words
my @test = qw(make money Fast sausage xstebxz ysvbWeioc);
for (@test) {
        my $n;
        $n += $freq{$_} for split //,lc ;
        $n = $n / length;
        print "$_ \t $n \n";
}
[download]

Gives this output:

make     5.82020185353244
money    5.88790803839067
Fast     6.05828727002723
sausage          7.14737507906388
xstebxz          4.24403141340688
ysvbWeioc        4.91268597200451
[download]

So if it returns > 5ish then it's a word, and if < 5ish, then it isn't! ;)

I'm unsure how reliable this might be (i.e. it could be completely bogus, and clearly depends on the length of the non-words) - and this isn't my field at all, so I might be doing it completely wrong! - I'm very open to corrections from other monks who might be more knowledgable in this area.

An alternative approach might be to look for the frequencies of letter pairs, i.e. certain consonants are unlikely to be next to each other in English words (e.g. vx), others are very likely (e.g. th).

I've believe there's been some research on this, by linguists looking at sci-fi books, trying to ascertain what makes a made-up, sci-fi word a 'possible' word for English. Wouldn't know where to start looking for it though.

andy.

PS This gives false positives for Quark, Zoo and basically anything likely to get you a high score in scrabble. It might still be useful for detecting some misspelled words though, i.e. a word is ok if it's in the dictionary *or* it passes this test.

Update: New version. It uses letter pairs, and it penalises quite heavily for letter pairs that don't exist in the dictionary. It seems to be ok with some misspellings (moke miney fost fzst), and it distinguishes quite well between nonexistent words (xstebxz) and unusual words (quiz).

#!/usr/bin/perl -w
use strict;

open(DICT,"/usr/dict/words") or die $!;
my %freq;

#find the frequency count
while (<DICT>) {
        next if /2/;  #Modula-2
        next if /3/;  #Modula-3 !
        chomp;
        my $lc = lc;
        $freq{$_}++ for map {substr($lc,$_,2)} (0..(length($lc)-1));
}
my $sum; $sum += $freq{$_} for keys %freq ;
$freq{$_} = ($freq{$_} / $sum)*100 for keys %freq;

#test some words and non-words
my @test = qw(make money Fast sausage Quark zoo between quiz xstebxz y
+svbWeioc hsyebem ytwekw
efp vjksfy shjkui moke miney fost fzst);

for (@test) {
        my $n;
        my $lc = lc;
        $n += ($freq{$_}||-1) for map {substr($lc,$_,2)} (0..(length($
+lc)-1));;
        $n = $n / (length($_)-1);
        print "$_ \t $n \n";
}
[download]

which gives output:

make     0.780188653301433
money    0.933641338723428
Fast     0.772580186819936
sausage          0.524755016546123
Quark    0.358400022000385
zoo      0.176140582460193
between          0.540751129811438
quiz     0.213770407648801
xstebxz          -0.0845593964561046
ysvbWeioc        0.0509343288507549
hsyebem          0.164407043789933
ytwekwefp        -0.0516768418447323
vjksfy   0.0587795286417512
shjkui   -0.0922531144295025
moke     0.703003969236128
miney    1.19489591067844
fost     0.72133762340841
fzst     0.243168422114054
[download]

(where everything with a score above (roughly) 0.2 is a word)

andy.

Comment on Maybe try frequency counting? Re: Matching non-words Select or Download Code