Andre_br has asked for the wisdom of the Perl Monks concerning the following question:
I need to develop a more trustable way to guess the language of strings as short as book titles. I've just tried Text::Language::Guess but it´s results are quite unreliable on tests with short strings.
I´ve noticed this module considers mostly the articles to guess. But I´d need to provide Perl full dictionary recognition for the six idioms involved. (fr,it,es,en,de,pt)
So, I have two major issues to overcome:
1) Where to find, or how to build, trustable and pure word lists for each of the languages. I thought about web crawling using Google's language restrictions, but the problem is that there are a lot of company names, product names, person´s names, in short, lots of garbage in between.
2) Once I have these pure words loaded in distinct .txts, how to do the matching approach?
a) Load them all each one in an array and grep each one of the title´s words against them?
b) Load them all each one in a tokenized string, Eg. "nous, lui, elle, parler, " and =~ m// each one of the title´s word against them?
Text::Language::Guess's article based guessing is not enough because you can have titles like 'Cutting Edges' that don´t happen to have any articles or pronoums. You just have to know that 'cutting' and 'edge(s)' is english and that´s all.
I wait for thy help then!
Thanks
André
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Idiom guessing script
by cog (Parson) on Nov 21, 2005 at 08:42 UTC | |
|
Re: Idiom guessing script
by Albannach (Monsignor) on Nov 21, 2005 at 04:07 UTC | |
|
Re: Idiom guessing script
by pileofrogs (Priest) on Nov 21, 2005 at 06:42 UTC | |
|
Re: Idiom guessing script
by swampyankee (Parson) on Nov 21, 2005 at 15:37 UTC | |
by Your Mother (Archbishop) on Nov 21, 2005 at 19:28 UTC | |
by Andre_br (Pilgrim) on Nov 29, 2005 at 16:53 UTC |