Matching non-words

Matts has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Maybe try frequency counting? Re: Matching non-words by andye (Curate) on Dec 06, 2001 at 17:32 UTC
Very interesting problem! ++Matts! How about frequency counting? `#!/usr/bin/perl -w use strict; open(DICT,"/usr/dict/words") or die $!; my %freq; #find the frequency count while (<DICT>) { next if /2/; #Modula-2 next if /3/; #Modula-3 ! chomp; $freq{$_}++ for split //,lc; } my $sum; $sum += $freq{$_} for keys %freq ; $freq{$_} = ($freq{$_} / $sum)100 for keys %freq; #test some words and non-words my @test = qw(make money Fast sausage xstebxz ysvbWeioc); for (@test) { my $n; $n += $freq{$_} for split //,lc ; $n = $n / length; print "$_ \t $n \n"; }` [download] Gives this output: `make 5.82020185353244 money 5.88790803839067 Fast 6.05828727002723 sausage 7.14737507906388 xstebxz 4.24403141340688 ysvbWeioc 4.91268597200451` [download] So if it returns > 5ish then it's a word, and if < 5ish, then it isn't! ;) I'm unsure how reliable this might be (i.e. it could be completely bogus, and clearly depends on the length of the non-words) - and this isn't my field at all, so I might be doing it completely wrong! - I'm very open to corrections from other monks who might be more knowledgable in this area. An alternative approach might be to look for the frequencies of letter pairs, i.e. certain consonants are unlikely to be next to each other in English words (e.g. vx), others are very likely (e.g. th). I've believe there's been some research on this, by linguists looking at sci-fi books, trying to ascertain what makes a made-up, sci-fi word a 'possible' word for English. Wouldn't know where to start looking for it though. andy. PS This gives false positives for Quark, Zoo and basically anything likely to get you a high score in scrabble. It might still be useful for detecting some misspelled words though, i.e. a word is ok if it's in the dictionary or* it passes this test. Update: New version. It uses letter pairs, and it penalises quite heavily for letter pairs that don't exist in the dictionary. It seems to be ok with some misspellings (moke miney fost fzst), and it distinguishes quite well between nonexistent words (xstebxz) and unusual words (quiz). #!/usr/bin/perl -w use strict; open(DICT,"/usr/dict/words") or die $!; my %freq; #find the frequency count while (<DICT>) { next if /2/; #Modula-2 next if /3/; #Modula-3 ! chomp; my $lc = lc; $freq{$_}++ for map {substr($lc,$_,2)} (0..(length($lc)-1)); } my $sum; $sum += $freq{$_} for keys %freq ; $freq{$_} = ($freq{$_} / $sum)*100 for keys %freq; #test some words and non-words my @test = qw(make money Fast sausage Quark zoo between quiz xstebxz y +svbWeioc hsyebem ytwekw efp vjksfy shjkui moke miney fost fzst); for (@test) { my $n; my $lc = lc; $n += ($freq{$_}\|\|-1) for map {substr($lc,$_,2)} (0..(length($ +lc)-1));; $n = $n / (length($_)-1); print "$_ \t $n \n"; } [download] which gives output: `make 0.780188653301433 money 0.933641338723428 Fast 0.772580186819936 sausage 0.524755016546123 Quark 0.358400022000385 zoo 0.176140582460193 between 0.540751129811438 quiz 0.213770407648801 xstebxz -0.0845593964561046 ysvbWeioc 0.0509343288507549 hsyebem 0.164407043789933 ytwekwefp -0.0516768418447323 vjksfy 0.0587795286417512 shjkui -0.0922531144295025 moke 0.703003969236128 miney 1.19489591067844 fost 0.72133762340841 fzst 0.243168422114054` [download] (where everything with a score above (roughly) 0.2 is a word) andy.	[reply] [d/l] [select]
Re: Matching non-words by mce (Curate) on Dec 06, 2001 at 15:14 UTC
Hi, As I understand your question, the thing between() can be an email address. And in that case, you want to trigger something. If I am completely wrong, well, I 'll need a coffee. You can use Email::Valid to check the contents between the braces and trigger it it is true, f.e. `use strict; use Email::Valid; my $subject="Subject: You've Won a Million Dollars!!!(xstebxz)" if ( $subject =~ /($.*$)$/ ) { if ( Email::Valid->address( -address => $1, -mxcheck => 0, )) { print "Subject contains email address"; } }` [download] I hope this helps, --------------------------- Dr. Mark Ceulemans Senior Consultant IT Masters, Belgium	[reply] [d/l]
Re: Matching non-words by chromatic (Archbishop) on Dec 06, 2001 at 23:51 UTC
Frequency analysis seems like the best possibility. Supposing you can normalize subjects, how about using regexes to build heuristics? `$subject =~ tr/a-zA-Z0-9_/ /sc; my @words = split(' ', $subject); my $regex = join(' ', map { "($_)?" } @words); my $num_matches = () = $potential_match =~ $regex; if ($num_matches == (@words - 1)) { register($words[-1]); }` [download] If the first `scalar @words - 1` tokens match, there's a good possibility the last piece is unique. Food for thought.	[reply] [d/l] [select]
Re: Matching non-words by jlongino (Parson) on Dec 06, 2001 at 20:33 UTC
Maybe I don't fully understand the problem, but why can't you just match everything except what is inside the parens? You can verify that whatever within the parens is relatively insignificant by comparing the subject length before and after extracting the "random" string. If the "random" string falls at the end of the message but isn't within parens, find out how much of the subject matches from beginning. Again, if the "random" string is much smaller than the matched portion, you can be relatively certain of success. Ignore this if I'm totally off base, I haven't had my coffee yet. --Jim	[reply]
Re: Matching non-words by Elgon (Curate) on Dec 06, 2001 at 21:09 UTC
Matts, A thought just occurred to me: Whenever I get spam it always seems to have a huge number of exclamation marks in it: FREE PORN!!!! See Britney Naked!!! etc... With this in mind, how about calculating the exlamation-mark-to-sentence ratio, rejecting mails with, say, more than one exclamation mark per sentence. Tongue slightly in cheek, Elgon "A nerd is someone who knows the difference between a compiled and an interpreted language, whereas a geek is a person who can explain it cogently over a couple of beers" - Elgon	[reply]
Re: Matching non-words by theorbtwo (Prior) on Dec 07, 2001 at 01:54 UTC
It seems to me that it'd be a lot nicer/easyer to through out the last word of the subject line if it is either in parens or has more then three spaces before it. (I don't know how generaly true it is, but I often see the unique part with a lot of spaces before it hoping it won't show up on a limited-width field.) Thanks, James Mastros, Just Another Perl Scribe	[reply]