What's a proper word? What's a word? Working with natural language is always a lot of fun.
Below is a very naive approach to the problem. It uses Grady Ward's Moby Word list. It does not provide alternate parsing of a string. It also just silently skips over a character or string of characters that is not a 'word'.
In the code...
-
The word list file is read in the while loop and any word that does not contain a character improper to a URL is added to the @words array. (I don't know what's proper and improper in a URL, so this is just an example.)
-
The words are sorted by the
@words = reverse sort @words;
statement so the longer of similar words is first in the array. This causes the subsequent regex alternation to match longest words first, so 'thisismydomain' parses as 'this' 'is' 'my' 'domain' and not 'this' 'is' 'my' 'do' 'main'.
-
Moby Words includes all single letters and a bunch of letter pairs (e.g., chemical element symbols) as words, so 'proper' groups of one- and two-letter words are defined, as well as specific words to ignore ('ism' and 'ismy' appear in the word list file and interfere with parsing out 'is' 'my'), and the @words array is further massaged to exclude unwanted stuff.
-
The massaged @words array is compiled into a huge regex alternation and the final regex is used to parse out 'proper' words.
Enjoy.
>perl -wMstrict -le
"my @words;
;;
my $fname = '../../../../moby/mwords/354984si.ngl';
open my $fh, '<', $fname or die qq{opening '$fname': $!};
while (<$fh>) {
chomp;
next if m{ [^[:alnum:]-] }xms;
push @words, $_;
}
close $fh;
;;
@words = reverse sort @words;
my $ok_one_letter = qr{\A [ai] \z}xmsi;
my $ok_two_letter = qr{\A (?: be | my | is | at | do) \z}xmsi;
my $ignore = qr{\A ism | ismy \z}xmsi;
my $ok_other = qr{\A (?! $ignore) .{3,} \z}xmsi;
@words =
grep { $_ =~ $ok_one_letter ||
$_ =~ $ok_two_letter ||
$_ =~ $ok_other
}
@words
;
my $words = join '|', @words;
$words = qr{ $words }xms;
;;
print '---------------';
for my $string ('www.thisismydomain.com', @ARGV) {
my @chunks = $string =~ m{ $words }xmsog;
printf qq{'$string' ->};
printf qq{ '$_'} for @chunks;
printf qq{\n};
}
" www.knowthyself.net kXnowthyXself.net
---------------
'www.thisismydomain.com' -> 'this' 'is' 'my' 'domain' 'com'
'www.knowthyself.net' -> 'know' 'thyself' 'net'
'kXnowthyXself.net' -> 'nowt' 'self' 'net'
Note: 'nowt' is short for 'nothing' or maybe 'naught'. Update: Actually, dict.org says 'nowt' means "Neat cattle", whatever that is (or those are).
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.