Idiom guessing script

Andre_br has asked for the wisdom of the Perl Monks concerning the following question:

Hello folks

I need to develop a more trustable way to guess the language of strings as short as book titles. I've just tried Text::Language::Guess but it´s results are quite unreliable on tests with short strings.

I´ve noticed this module considers mostly the articles to guess. But I´d need to provide Perl full dictionary recognition for the six idioms involved. (fr,it,es,en,de,pt)

So, I have two major issues to overcome:
1) Where to find, or how to build, trustable and pure word lists for each of the languages. I thought about web crawling using Google's language restrictions, but the problem is that there are a lot of company names, product names, person´s names, in short, lots of garbage in between.

2) Once I have these pure words loaded in distinct .txts, how to do the matching approach?
a) Load them all each one in an array and grep each one of the title´s words against them?
b) Load them all each one in a tokenized string, Eg. "nous, lui, elle, parler, " and =~ m// each one of the title´s word against them?

Text::Language::Guess's article based guessing is not enough because you can have titles like 'Cutting Edges' that don´t happen to have any articles or pronoums. You just have to know that 'cutting' and 'edge(s)' is english and that´s all.

I wait for thy help then!

Thanks

André

Comment on Idiom guessing script

Replies are listed 'Best First'.
Re: Idiom guessing script by cog (Parson) on Nov 21, 2005 at 08:42 UTC
`use Lingua::Identify qw/:all/; set_active_languages( qw/fr it es en de pt/ ); langof( { 'method' => [ 'ngrams2', 'ngrams3' ] } ,$string );` [download] SEE ALSO: Lingua::Identify	[reply] [d/l]
Re: Idiom guessing script by Albannach (Monsignor) on Nov 21, 2005 at 04:07 UTC
It does not sound like a simple problem, because you are not dealing with much data upon which to base your decision. It strikes me that it may be possible to choose just a few hundred words from each potential language, words that are both commonly used and relatively unique to that tongue. However even this may not work for something like book titles which are not necessarily common usage (in English at least). If you could get large word lists for different languages (perhaps take a sample from some major newspapers?) you could build your own such list of 'indicator words'. I would not keep the langages separate, but have each word in the list tagged as to what language(s) it suggests, then you could sort of take a poll of your title's words to get a guess as to the language used. On the chance that you are actually talking about book titles, perhaps it would help you to know that the ISBN issued for every book published starts with a code called the Group Identifier. While this is not necessarily a reliable indicator of the language, it may be of some use, perhaps to verify a language-based determination, or to help you select what language(s) to test against. -- I'd like to be able to assign to an luser	[reply]
Re: Idiom guessing script by pileofrogs (Priest) on Nov 21, 2005 at 06:42 UTC
I agree with Albannach. This sounds like a very comlex problem. Are you catagorizing a known set of books, or does your script need to be able to tell the language of any possible book title? If it's a known set of books, then that could make this job easier. What's the actual problem you're trying to solve by figuring out the language of a book title? Maybe there's a better solution? There are bound to be books written in one language with titles in another, and there will be books with words in the title from 2 or more languages. Not to mention words that are parts of multiple languages... Oh, yeah and books with titles that are part of no language... If I had to do this, I'd think about writing a script to query Amazon or some other online list. Take advantage of a huge database someone else has set up.	[reply]
Re: Idiom guessing script by swampyankee (Parson) on Nov 21, 2005 at 15:37 UTC
You may be able to get word lists from Open Office or ispell; I won't vouch for the completeness or accuracy of either. As noted by albannach & pileofrogs, this is highly non-trivial. I've also been told -- by native speakers of Brazilian Portuguese -- that they could "get along" in Spanish, and by native speakers of (iirc, Puerto Rican) Spanish, that they could "be understood" by native Italian speakers (a confusing concept; Italy's regional dialects are alive, well, and not necessarily mutually comprehensible); all of this would seem to make unambiguous identification of a title as Italian, Spanish, or Portuguese impossible: the languages may well be too similar. Of course, identifying the language, by title, of books like Cervante's Don Quixote, Orwell's 1984 or Burgess's M/F is impossible. And is that copy of Sagan's Bonjour, Tritesse in French or has the translator kept the title in French? What would you consider adequate reliability? emc	[reply]
Re^2: Idiom guessing script by Your Mother (Archbishop) on Nov 21, 2005 at 19:28 UTC
Not to trivialize it, because it is (difficult\|impossible)--I like cog's answer and I'm looking forward to having a reason to try that module--but written language is dramatically more predictable than spoken and there are many frequent and unique points in those languages. Consider- `due dois dos deux`	[reply] [d/l]
Re^3: Idiom guessing script by Andre_br (Pilgrim) on Nov 29, 2005 at 16:53 UTC
Hello folks, Thanks a lot for all the inputs. I´ve just tryed Lingua::Identify, but it suffers from the same problem as the other module: simply not trustable for small strings. For example, is says "Big Cat" is italian, "Deux chansons" is italian, and "Open bridge" is deutsch. So, you can see how problematic it would be to use it. I´ve checked ispell and it seems there are some word lists there maybe I can use. I´ll have to look closer, but at first they look not as extensive as necessary. As for the ISBN idea, the problem is that I don´t have the isbns for these books. And regarding fetching other online databases, I don´t think they´ll be trusted to have books in all languages, at least Amazon has just failed this test a few moments ago. I think I´ll have to free the beast to crawl out the world. (Wow, chill out, I´m not the messenger of the apocalypse! Just some metaphor! hahahah) If you guys think of something, please let me know. Take care, fellow monks André	[reply]