in reply to Matching non-words
How about frequency counting?
Gives this output:#!/usr/bin/perl -w use strict; open(DICT,"/usr/dict/words") or die $!; my %freq; #find the frequency count while (<DICT>) { next if /2/; #Modula-2 next if /3/; #Modula-3 ! chomp; $freq{$_}++ for split //,lc; } my $sum; $sum += $freq{$_} for keys %freq ; $freq{$_} = ($freq{$_} / $sum)*100 for keys %freq; #test some words and non-words my @test = qw(make money Fast sausage xstebxz ysvbWeioc); for (@test) { my $n; $n += $freq{$_} for split //,lc ; $n = $n / length; print "$_ \t $n \n"; }
So if it returns > 5ish then it's a word, and if < 5ish, then it isn't! ;)make 5.82020185353244 money 5.88790803839067 Fast 6.05828727002723 sausage 7.14737507906388 xstebxz 4.24403141340688 ysvbWeioc 4.91268597200451
I'm unsure how reliable this might be (i.e. it could be completely bogus, and clearly depends on the length of the non-words) - and this isn't my field at all, so I might be doing it completely wrong! - I'm very open to corrections from other monks who might be more knowledgable in this area.
An alternative approach might be to look for the frequencies of letter pairs, i.e. certain consonants are unlikely to be next to each other in English words (e.g. vx), others are very likely (e.g. th).
I've believe there's been some research on this, by linguists looking at sci-fi books, trying to ascertain what makes a made-up, sci-fi word a 'possible' word for English. Wouldn't know where to start looking for it though.
andy.
PS This gives false positives for Quark, Zoo and basically anything likely to get you a high score in scrabble. It might still be useful for detecting some misspelled words though, i.e. a word is ok if it's in the dictionary *or* it passes this test.
Update: New version. It uses letter pairs, and it penalises quite heavily for letter pairs that don't exist in the dictionary. It seems to be ok with some misspellings (moke miney fost fzst), and it distinguishes quite well between nonexistent words (xstebxz) and unusual words (quiz).
which gives output:#!/usr/bin/perl -w use strict; open(DICT,"/usr/dict/words") or die $!; my %freq; #find the frequency count while (<DICT>) { next if /2/; #Modula-2 next if /3/; #Modula-3 ! chomp; my $lc = lc; $freq{$_}++ for map {substr($lc,$_,2)} (0..(length($lc)-1)); } my $sum; $sum += $freq{$_} for keys %freq ; $freq{$_} = ($freq{$_} / $sum)*100 for keys %freq; #test some words and non-words my @test = qw(make money Fast sausage Quark zoo between quiz xstebxz y +svbWeioc hsyebem ytwekw efp vjksfy shjkui moke miney fost fzst); for (@test) { my $n; my $lc = lc; $n += ($freq{$_}||-1) for map {substr($lc,$_,2)} (0..(length($ +lc)-1));; $n = $n / (length($_)-1); print "$_ \t $n \n"; }
make 0.780188653301433 money 0.933641338723428 Fast 0.772580186819936 sausage 0.524755016546123 Quark 0.358400022000385 zoo 0.176140582460193 between 0.540751129811438 quiz 0.213770407648801 xstebxz -0.0845593964561046 ysvbWeioc 0.0509343288507549 hsyebem 0.164407043789933 ytwekwefp -0.0516768418447323 vjksfy 0.0587795286417512 shjkui -0.0922531144295025 moke 0.703003969236128 miney 1.19489591067844 fost 0.72133762340841 fzst 0.243168422114054
(where everything with a score above (roughly) 0.2 is a word)
andy.
|
|---|