recognizing URL text

vacant has asked for the wisdom of the Perl Monks concerning the following question:

Greeings Monks:

As part of a mad scheme to evaluate certain aspects of URLs, I have been trying to categorize "words" used in the URL as, for instance, random characters, nonsense words, well-known "words" (e.g. "redir"), near- and misspelled natural language words ("google"), 7337 5p33c4, run-together words ("perlmonks"), and so on.

I have had little success with such means as vowel placement, character frequency, and the like. Neither have I been able to find anything that seems useful on the 'net. I think it would require something like fuzzy matching to a dictionary, or some modification of a spelling checker, or some more sophisticated idea or combination of methods I haven't thought of. There must be some routines, or maybe a Perl module, or at least a few masters' theses from MIT on the subject.

Any ideas greatly appreciated.

Added: Many thanks for the ideas (and future ones, if any). The experimentation is going to be interesting, at least.

Comment on recognizing URL text

Replies are listed 'Best First'.
Re: recognizing URL text by Zaxo (Archbishop) on Sep 30, 2005 at 02:48 UTC
A single word makes too small a sample, but for larger runs of text the frequency of coincident characters is a guide. I think Kahn's popular book on crypto describes the algorithm and its uses. It can detect key length for multiple cyphers, and can often identify the language of the plaintext. Your character frequency and vowel placement ideas seem good to me. They may also be suffering from small sample size. As it is, you might try tr// of 1337 digits for words matching /\d/, minimum Text::Levenshtein distance from words in a wordlist, and inclusion (through index or m//i) of words in the wordlist. Those will be time-consuming, but I don't see a way around that. Added: To recognise language-like nonwords, you could check that all adjacent pairs of characters are high-probability ones in the language. After Compline, Zaxo	[reply]
Re: recognizing URL text by MidLifeXis (Monsignor) on Sep 30, 2005 at 17:25 UTC
The crack dictionary generator engine might be able to help you here. Instead of having the password generator use crypt, md5, or sha1, code up an "encryptor" that is an identity function: `f(x) == x`. The last time I used it (I was a security admin, it was my job :), it had a pretty decent l33t sp34k module, as well as concatonation, misspellings, and other useful rules. Additionally, if you want to generate your own rules, the ruleset language is pretty easy to learn and use. --MidLifeXis	[reply] [d/l]
Re: recognizing URL text by StoneTable (Beadle) on Oct 01, 2005 at 00:51 UTC
You can use String::Approx to do some fuzzy matching. You'll still need a dictionary of good words, but it does seem to work fairly well.	[reply]
Re: recognizing URL text by toma (Vicar) on Oct 03, 2005 at 01:59 UTC
The tuple approach discussed in Some kind of fuzzy logic may be useful. Using the tuple approach, certain tuples probably only occur in certain types of URLs. It should work perfectly the first time! - toma	[reply]