comment on

Hello folks

I need to develop a more trustable way to guess the language of strings as short as book titles. I've just tried Text::Language::Guess but it´s results are quite unreliable on tests with short strings.

I´ve noticed this module considers mostly the articles to guess. But I´d need to provide Perl full dictionary recognition for the six idioms involved. (fr,it,es,en,de,pt)

So, I have two major issues to overcome:
1) Where to find, or how to build, trustable and pure word lists for each of the languages. I thought about web crawling using Google's language restrictions, but the problem is that there are a lot of company names, product names, person´s names, in short, lots of garbage in between.

2) Once I have these pure words loaded in distinct .txts, how to do the matching approach?
a) Load them all each one in an array and grep each one of the title´s words against them?
b) Load them all each one in a tokenized string, Eg. "nous, lui, elle, parler, " and =~ m// each one of the title´s word against them?

Text::Language::Guess's article based guessing is not enough because you can have titles like 'Cutting Edges' that don´t happen to have any articles or pronoums. You just have to know that 'cutting' and 'edge(s)' is english and that´s all.

I wait for thy help then!

Thanks

André

In reply to Idiom guessing script by Andre_br

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.