in reply to Re^2: Is there a module to break a string into proper words?
in thread Is there a module to break a string into proper words?

  1. Take all your words, sort them.
  2. Look at the words. If a word matches to the left side of the domain, output the word, remove that part from the left side of the domain.
  3. Repeat

If you want to extend that approach to allowing multiple words, you will need to remember where you decided on one word and go back there to decide on another word. Recursion is a good tool there.

  • Comment on Re^3: Is there a module to break a string into proper words?

Replies are listed 'Best First'.
Re^4: Is there a module to break a string into proper words?
by elef (Friar) on Dec 29, 2010 at 11:53 UTC
    I have no experience in this area, but this sounds like a good plan. If you can get a dictionary with word frequency indications, you could get perl to find all the word combinations that can cover a given URL, and then choose the most likely one based on word frequencies.
    Word frequencies could also be established by comparing dictionaries. Say, find a small dictionary with 1000 word, a medium size dictionary (10,000) and a big one (100,000). Every word that's in all 3 gets 3 points, all words that are in 2 of the 3 get 2 points, all words that are in only 1 get 1 point. Then pick the solution with the highest point-per-word average.
    For English, you can find premade and much more granular word frequency lists as well. Here's one: http://en.wiktionary.org/wiki/Wiktionary:Frequency_lists/PG/2005/10/1-10000
    And here are a couple more: http://www.wordfrequency.info/

    Other rules for dealing with ambiguities could be established based on the actual data: log all the ambiguous URLs in a separate file and have a look at them, then devise rules as needed.