As part of a mad scheme to evaluate certain aspects of URLs, I have been trying to categorize "words" used in the URL as, for instance, random characters, nonsense words, well-known "words" (e.g. "redir"), near- and misspelled natural language words ("google"), 7337 5p33c4, run-together words ("perlmonks"), and so on.
I have had little success with such means as vowel placement, character frequency, and the like. Neither have I been able to find anything that seems useful on the 'net. I think it would require something like fuzzy matching to a dictionary, or some modification of a spelling checker, or some more sophisticated idea or combination of methods I haven't thought of. There must be some routines, or maybe a Perl module, or at least a few masters' theses from MIT on the subject.
Any ideas greatly appreciated.
Added: Many thanks for the ideas (and future ones, if any). The experimentation is going to be interesting, at least.
In reply to recognizing URL text by vacant
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |