Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks, I'm working on some domain name related project and am looking for a solution to break, for example, www.thisismydomain.com into 'this is my domain'. Any thoughts?
  • Comment on Is there a module to break a string into proper words?

Replies are listed 'Best First'.
Re: Is there a module to break a string into proper words?
by Corion (Patriarch) on Dec 29, 2010 at 08:18 UTC

    Just take a dictionary for your target language and break the "main" part of the domain into words.

    Consider that a domain might resolve into multiple words:

    penisland.com findtherapist.com
      Mind sharing some codes or an algorithm? I have no idea how to do that. And, you're right, there could always be more than one way to break the domain name, so the function can return an array instead of a scalar.
        1. Take all your words, sort them.
        2. Look at the words. If a word matches to the left side of the domain, output the word, remove that part from the left side of the domain.
        3. Repeat

        If you want to extend that approach to allowing multiple words, you will need to remember where you decided on one word and go back there to decide on another word. Recursion is a good tool there.

        What's a proper word? What's a word? Working with natural language is always a lot of fun.

        Below is a very naive approach to the problem. It uses Grady Ward's Moby Word list. It does not provide alternate parsing of a string. It also just silently skips over a character or string of characters that is not a 'word'.

        In the code...

        • The word list file is read in the  while loop and any word that does not contain a character improper to a URL is added to the  @words array. (I don't know what's proper and improper in a URL, so this is just an example.)
        • The words are sorted by the
              @words = reverse sort @words;
          statement so the longer of similar words is first in the array. This causes the subsequent regex alternation to match longest words first, so 'thisismydomain' parses as 'this' 'is' 'my' 'domain' and not 'this' 'is' 'my' 'do' 'main'.
        • Moby Words includes all single letters and a bunch of letter pairs (e.g., chemical element symbols) as words, so 'proper' groups of one- and two-letter words are defined, as well as specific words to ignore ('ism' and 'ismy' appear in the word list file and interfere with parsing out 'is' 'my'), and the  @words array is further massaged to exclude unwanted stuff.
        • The massaged  @words array is compiled into a huge regex alternation and the final regex is used to parse out 'proper' words.
        Enjoy.

        >perl -wMstrict -le "my @words; ;; my $fname = '../../../../moby/mwords/354984si.ngl'; open my $fh, '<', $fname or die qq{opening '$fname': $!}; while (<$fh>) { chomp; next if m{ [^[:alnum:]-] }xms; push @words, $_; } close $fh; ;; @words = reverse sort @words; my $ok_one_letter = qr{\A [ai] \z}xmsi; my $ok_two_letter = qr{\A (?: be | my | is | at | do) \z}xmsi; my $ignore = qr{\A ism | ismy \z}xmsi; my $ok_other = qr{\A (?! $ignore) .{3,} \z}xmsi; @words = grep { $_ =~ $ok_one_letter || $_ =~ $ok_two_letter || $_ =~ $ok_other } @words ; my $words = join '|', @words; $words = qr{ $words }xms; ;; print '---------------'; for my $string ('www.thisismydomain.com', @ARGV) { my @chunks = $string =~ m{ $words }xmsog; printf qq{'$string' ->}; printf qq{ '$_'} for @chunks; printf qq{\n}; } " www.knowthyself.net kXnowthyXself.net --------------- 'www.thisismydomain.com' -> 'this' 'is' 'my' 'domain' 'com' 'www.knowthyself.net' -> 'know' 'thyself' 'net' 'kXnowthyXself.net' -> 'nowt' 'self' 'net'

        Note: 'nowt' is short for 'nothing' or maybe 'naught'. Update: Actually, dict.org says 'nowt' means "Neat cattle", whatever that is (or those are).

Re: Is there a module to break a string into proper words?
by JavaFan (Canon) on Dec 29, 2010 at 14:43 UTC
    Why shouldn't thisismydomain break down to this is my do main?