Re: Re: Finding dictionary words in a string.

What I meant was that the entire string "Hello" gets a better score than "HelloHowAreYou" because there are fewer words. But either of those get a better score than "oijwfHellowoifef", for example.

There needs to be some tradeoff, which I'm not quite sure how to judge at the moment. For example "ThisIsAStringThatGoesOnAndOnAndOnForever" and "perlxmonks" would probably have a nearly equivalent score because the first one doesn't have any junk characters, but has a high word count while the second one has only a few junk characters, but a low word count

Comment on Re: Re: Finding dictionary words in a string.

Replies are listed 'Best First'.

Re: Re: Re: Finding dictionary words in a string.
by tachyon (Chancellor) on Mar 13, 2004 at 19:57 UTC

There is not really a trade off required. Given the task which is to look for 'good' urls that match as closely as possible 1 or more words you want to do something like (pseurocode)

# get the domain part dropping www. and passing back the
# domain and tld (ie .com .net) or sld.tld (ie co.uk, com.au )
my ($domain, $tld) = get_domain( $url );

# chop domain into all possible substrings say 3-16 chars long, retrun
+ ary ref
# there are very few valid well known words > 16 chars, virtually none
+ > 24 chars
my $tokens = tokenize( $domain );

# get the possible words ordered by length(word) DESC ie longest first
# use a hash lookup or a RDBMS with a dynamicly generated SQL IN claus
+e
my $words = get_real_words_from_dict( $tokens )

# substitute out the words, as we remove longest first 
# we aviod finding substrings like 'be' in 'beetle'
my $residual = $domain;
my @words = ();
for my $word( @$words ) {
    # we may have duplicates of same word
    push @words, $word while $residual =~ s/\Q$word\E/;
}

# remove '-' from residual so 'come-here' will be two words, no residu
+al
$residual =~ s/-//g;

# work out % residual
$residual = 100*$residual/$domain;

# So now we have our data 
# @words 0..N is number of words found
# $residual is the %residual on that domain name
# $tld is the domain name

# say we inserted into a Db table like:
CREATE TABLE urls (
    url CHAR(75),
    words INT,
    residual FLOAT,
    tld CHAR(10),
)

"INSERT INTO urls (url,words, residual, tld) VALUES(?,?,?,?)",
    $url, scalar @words, $residual, $tld

You can now generate reports. Essentially you want something like:

SELECT url FROM urls
WHERE words >= 1
ORDER BY words ASC, residulal ASC
GROUP BY tld
[download]

This does not apply limits or add bias for say a pref for .com domain names. It will output urls with single word, lowest->highest residual first, then two words etc. Given what you want if the residual is > 10-20% you can probably just ignore those URLs and not insert them.

cheers

tachyon

[reply]
[d/l]


The stupid question is the question not asked
	PerlMonks