in reply to Re^3: Is there a module to break a string into proper words?
in thread Is there a module to break a string into proper words?

I have no experience in this area, but this sounds like a good plan. If you can get a dictionary with word frequency indications, you could get perl to find all the word combinations that can cover a given URL, and then choose the most likely one based on word frequencies.
Word frequencies could also be established by comparing dictionaries. Say, find a small dictionary with 1000 word, a medium size dictionary (10,000) and a big one (100,000). Every word that's in all 3 gets 3 points, all words that are in 2 of the 3 get 2 points, all words that are in only 1 get 1 point. Then pick the solution with the highest point-per-word average.
For English, you can find premade and much more granular word frequency lists as well. Here's one: http://en.wiktionary.org/wiki/Wiktionary:Frequency_lists/PG/2005/10/1-10000
And here are a couple more: http://www.wordfrequency.info/

Other rules for dealing with ambiguities could be established based on the actual data: log all the ambiguous URLs in a separate file and have a look at them, then devise rules as needed.
  • Comment on Re^4: Is there a module to break a string into proper words?