in reply to Is there a module to break a string into proper words?

Just take a dictionary for your target language and break the "main" part of the domain into words.

Consider that a domain might resolve into multiple words:

penisland.com findtherapist.com

Replies are listed 'Best First'.
Re^2: Is there a module to break a string into proper words?
by Anonymous Monk on Dec 29, 2010 at 08:24 UTC
    Mind sharing some codes or an algorithm? I have no idea how to do that. And, you're right, there could always be more than one way to break the domain name, so the function can return an array instead of a scalar.
      1. Take all your words, sort them.
      2. Look at the words. If a word matches to the left side of the domain, output the word, remove that part from the left side of the domain.
      3. Repeat

      If you want to extend that approach to allowing multiple words, you will need to remember where you decided on one word and go back there to decide on another word. Recursion is a good tool there.

        I have no experience in this area, but this sounds like a good plan. If you can get a dictionary with word frequency indications, you could get perl to find all the word combinations that can cover a given URL, and then choose the most likely one based on word frequencies.
        Word frequencies could also be established by comparing dictionaries. Say, find a small dictionary with 1000 word, a medium size dictionary (10,000) and a big one (100,000). Every word that's in all 3 gets 3 points, all words that are in 2 of the 3 get 2 points, all words that are in only 1 get 1 point. Then pick the solution with the highest point-per-word average.
        For English, you can find premade and much more granular word frequency lists as well. Here's one: http://en.wiktionary.org/wiki/Wiktionary:Frequency_lists/PG/2005/10/1-10000
        And here are a couple more: http://www.wordfrequency.info/

        Other rules for dealing with ambiguities could be established based on the actual data: log all the ambiguous URLs in a separate file and have a look at them, then devise rules as needed.

      What's a proper word? What's a word? Working with natural language is always a lot of fun.

      Below is a very naive approach to the problem. It uses Grady Ward's Moby Word list. It does not provide alternate parsing of a string. It also just silently skips over a character or string of characters that is not a 'word'.

      In the code...

      • The word list file is read in the  while loop and any word that does not contain a character improper to a URL is added to the  @words array. (I don't know what's proper and improper in a URL, so this is just an example.)
      • The words are sorted by the
            @words = reverse sort @words;
        statement so the longer of similar words is first in the array. This causes the subsequent regex alternation to match longest words first, so 'thisismydomain' parses as 'this' 'is' 'my' 'domain' and not 'this' 'is' 'my' 'do' 'main'.
      • Moby Words includes all single letters and a bunch of letter pairs (e.g., chemical element symbols) as words, so 'proper' groups of one- and two-letter words are defined, as well as specific words to ignore ('ism' and 'ismy' appear in the word list file and interfere with parsing out 'is' 'my'), and the  @words array is further massaged to exclude unwanted stuff.
      • The massaged  @words array is compiled into a huge regex alternation and the final regex is used to parse out 'proper' words.
      Enjoy.

      >perl -wMstrict -le "my @words; ;; my $fname = '../../../../moby/mwords/354984si.ngl'; open my $fh, '<', $fname or die qq{opening '$fname': $!}; while (<$fh>) { chomp; next if m{ [^[:alnum:]-] }xms; push @words, $_; } close $fh; ;; @words = reverse sort @words; my $ok_one_letter = qr{\A [ai] \z}xmsi; my $ok_two_letter = qr{\A (?: be | my | is | at | do) \z}xmsi; my $ignore = qr{\A ism | ismy \z}xmsi; my $ok_other = qr{\A (?! $ignore) .{3,} \z}xmsi; @words = grep { $_ =~ $ok_one_letter || $_ =~ $ok_two_letter || $_ =~ $ok_other } @words ; my $words = join '|', @words; $words = qr{ $words }xms; ;; print '---------------'; for my $string ('www.thisismydomain.com', @ARGV) { my @chunks = $string =~ m{ $words }xmsog; printf qq{'$string' ->}; printf qq{ '$_'} for @chunks; printf qq{\n}; } " www.knowthyself.net kXnowthyXself.net --------------- 'www.thisismydomain.com' -> 'this' 'is' 'my' 'domain' 'com' 'www.knowthyself.net' -> 'know' 'thyself' 'net' 'kXnowthyXself.net' -> 'nowt' 'self' 'net'

      Note: 'nowt' is short for 'nothing' or maybe 'naught'. Update: Actually, dict.org says 'nowt' means "Neat cattle", whatever that is (or those are).

        This looks very promising, is it possible to expand it so that it retu +rns an array instead? e.g. thisismydomain -> this is my domain this is my do main his is my domain his is my do main