mwhiting has asked for the wisdom of the Perl Monks concerning the following question:
I stole some code from a previous post (http://www.perlmonks.org/?node_id=609166) about how to unaccent characters in a string. I don't want to use the Text::Unaccent module, I want to just put in the simplified code suggested by salva in the above article. Here's the code I'm using, modified from his:
Output is:my %table = ( 'À' => 'A', 'Á' => 'A', 'Â' => 'A', 'Ã' => 'A', 'Ä' => ' +A', 'Å' => 'A', 'Ç' => 'C', 'È' => 'E', 'É' => 'E', 'Ê' => 'E', 'Ë' => 'E', 'Ì' => 'I', 'Í' => 'I', 'Î' => 'I', 'Ï' => 'I', 'Ñ' => 'N', 'Ò' => 'O', 'Ó' => 'O', 'Ô' => 'O', 'Õ' => 'O', 'Ö' => ' +O', 'Ù' => 'U', 'Ú' => 'U', 'Û' => 'U', 'Ü' => 'U', 'à' => 'a', 'á' => 'a', 'â' => 'a', 'ã' => 'a', 'ä' => ' +a', 'å' => 'a', 'ç' => 'c', 'è' => 'e', 'é' => 'e', 'ê' => 'e', 'ë' => 'e', 'ì' => 'i', 'í' => 'i', 'î' => 'i', 'ï' => 'i', 'ñ' => 'n', 'ò' => 'o', 'ó' => 'o', 'ô' => 'o', 'õ' => 'o', 'ö' => ' +o', 'ß' => 'ss', 'ù' => 'u', 'ú' => 'u', 'û' => 'u', 'ü' => 'u', 'ý' => 'y' ); $str = "Les Misérables"; $str =~ s/([^\x00-\x7F])/$table{'$1'} || '?'/ge; print "str:$str<br>";
I eliminated the subroutine and the 'shift' command that he had in his code. The code seems to notice that the character is in the right hex range, but it doesn't find the character to replace with.str:Les Mis?rables
A second problem: when I use this function on a string coming from the datafile content I will actually be using it on, it replaces it with two question marks, as in: Les Mis??rables. I have seen it convert the accented e to two characters with other methods I have been attempting to use too. Is this something about unicode conversions, using more than one byte to represent something?
Thanks! Michael
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Unaccenting characters
by moritz (Cardinal) on Aug 28, 2013 at 19:06 UTC | |
by mwhiting (Beadle) on Aug 29, 2013 at 16:37 UTC | |
by moritz (Cardinal) on Aug 29, 2013 at 17:36 UTC | |
|
Re: Unaccenting characters
by choroba (Cardinal) on Aug 28, 2013 at 16:59 UTC | |
by mwhiting (Beadle) on Aug 29, 2013 at 16:41 UTC | |
|
Re: Unaccenting characters
by Corion (Patriarch) on Aug 28, 2013 at 20:53 UTC | |
|
Re: Unaccenting characters
by Laurent_R (Canon) on Aug 28, 2013 at 22:17 UTC | |
by choroba (Cardinal) on Aug 28, 2013 at 22:33 UTC | |
by Laurent_R (Canon) on Aug 29, 2013 at 00:23 UTC | |
by mwhiting (Beadle) on Aug 29, 2013 at 16:35 UTC |