comment on

This is a far-from-ideal solution, but its quick (though dirty) and will generally work right. It doesn't work at all for characters that are outside the English character set, but will work for characters that are basically accented english characters. That being said, here it is:

In short, you strip all accents from user entered search terms and database entries (at least the database entries that are used for searching, leaving the "display" database entries alone). i.e. map Ä to A, å to a, etc....

So if someone tries to look up "Äbc" in your file, it would be converted to "Abc" before it tries to match on the file. Professionally I work with a lot of bands and find that that conversion is very handy. For example, most people will spell Moxy Früvous as "Moxy Fruvous" (without the umlaut over the u) when performing a search, so unless I did the conversion to both store and clean-down search terms by removing accents I would never find it.

Now this is my very nassty looking translation statment that will replace accented characters (iso-8859-1) with their non-accented ascii equivalents.

  $s=~tr/\xC0\xC1\xC2\xC3\xC4\xC5\xC6\xC7\xC8\xC9\xCA\xCB\xCC\xCD\xCE\
+xCF\xD0\xD1\xD2\xD3\xD4\xD5\xD6\xD8\xD9\xDA\xDB\xDC\xDD\xDF\xE0\xE1\x
+E2\xE3\xE4\xE5\xE6\xE7\xE8\xE9\xEA\xEB\xEC\xED\xEE\xEF\xF1\xF2\xF3\xF
+4\xF5\xF6\xF8\xF9\xFA\xFB\xFC\xFD\xFF/\x41\x41\x41\x41\x41\x41\x41\x4
+3\x45\x45\x45\x45\x49\x49\x49\x49\x44\x4E\x4F\x4F\x4F\x4F\x4F\x4F\x55
+\x55\x55\x55\x59\x73\x61\x61\x61\x61\x61\x61\x61\x63\x65\x65\x65\x65\
+x69\x69\x69\x69\x6E\x6F\x6F\x6F\x6F\x6F\x6F\x75\x75\x75\x75\x79\x79/;
[download]

As I said.. dirty and imperfect, but quick and generally works right....

Les Howard
www.lesandchris.com
Author of Net::Syslog and Number::Spell

In reply to Re: Problem matching non-english chars by lhoward
in thread Problem matching non-english chars by Guano

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.