vacant has asked for the wisdom of the Perl Monks concerning the following question:
As part of a mad scheme to evaluate certain aspects of URLs, I have been trying to categorize "words" used in the URL as, for instance, random characters, nonsense words, well-known "words" (e.g. "redir"), near- and misspelled natural language words ("google"), 7337 5p33c4, run-together words ("perlmonks"), and so on.
I have had little success with such means as vowel placement, character frequency, and the like. Neither have I been able to find anything that seems useful on the 'net. I think it would require something like fuzzy matching to a dictionary, or some modification of a spelling checker, or some more sophisticated idea or combination of methods I haven't thought of. There must be some routines, or maybe a Perl module, or at least a few masters' theses from MIT on the subject.
Any ideas greatly appreciated.
Added: Many thanks for the ideas (and future ones, if any). The experimentation is going to be interesting, at least.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: recognizing URL text
by Zaxo (Archbishop) on Sep 30, 2005 at 02:48 UTC | |
|
Re: recognizing URL text
by MidLifeXis (Monsignor) on Sep 30, 2005 at 17:25 UTC | |
|
Re: recognizing URL text
by StoneTable (Beadle) on Oct 01, 2005 at 00:51 UTC | |
|
Re: recognizing URL text
by toma (Vicar) on Oct 03, 2005 at 01:59 UTC |