utf-8 problem

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to read a char file that has special characters (accented characters) and then make substitutions in my string for the html equivalent when the program finds these chars. Here is my program:


use Encode;
use utf8;
#use open IO => ':locale';

#my $s = "El supersĂłnico de los Indi ";
my $s1 = "El supero de los Indi ";

#$s1 = decode_utf8( $s);

print "\n\nStart string: $s1\n\n";

my $s2 = &fix_special_characters($s1);

print"\nEnd string: $s2\n\n";


sub fix_special_characters
{
my($string) = @_;

open(C,"<:utf8","chars.txt");
my @c = <C>;

for(my $i=0; $i < @c; $i++)
{
        my ($special,$htmlchar) = split(/\t/,$c[$i]);
        print "$special : $htmlchar";
        $string =~ s/$special/$htmlchar/ig;     ## this is generating 
+the error message
}

return $string;
}
[download]

This is the output when the substitution line (line 30) is commented:

Start string: El supero de los Indi

Á    : &Aacute;
á    : &aacute;
É    : &Eacute;
é   : &eacute;
Í    : &Iacute;
í    : &iacute;
Ń    : &Ntilde;
ń    : &ntilde;
Ó   : &Oacute;
ó   : &oacute;
Ú    : &Uacute;
ú    : &uacute;
Ü    : &Uuml;
ü    : &uuml;
ż   : &iquest;
Ą   : &iexcl;End string: El supero de los Indi
[download]

However, when I uncode that substitution line I get the following error messages for every line in the char file:

Malformed UTF-8 character (unexpected non-continuation byte 0x20, imme
+diately after start byte 0xc1) in regexp compilation at sp.pl line 30
+, <C> line 16.
Malformed UTF-8 character (unexpected non-continuation byte 0x20, imme
+diately after start byte 0xc1) in regexp compilation at sp.pl line 30
+, <C> line 16.
[download]

I have spent hours trying different methods to make this work with no luck. Any monks out there that can help with this? Thank you

Comment on utf-8 problem Select or Download Code

Replies are listed 'Best First'.
Re: utf-8 problem by kennethk (Abbot) on Jan 29, 2009 at 22:07 UTC
Given that an essential part of your program is your chars.txt file, that would be helpful to post, particularly as I execute your code as posted and do not get that error. I also note from your output formatting that at the least you have some newlines and spaces in your file that are not accounted for in your code. What happens when you substitute the following into your code? `sub fix_special_characters { my($string) = @_; open(C,"<:utf8","chars.txt"); my @c = <C>; chomp @c; for my $i (0 .. $#c) { my ($special, $htmlchar) = split /\s+/, $c[$i]; print "$special : $htmlchar\n"; $string =~ s/$special/$htmlchar/ig; ## this is generating +the error message } return $string; }` [download] Note I also swapped away from you C-style for loop: why this is a good idea is discussed in 723825. As well, it's a good idea to `use strict; use warnings` to avoid needless headaches.	[reply] [d/l] [select]
Re: utf-8 problem by eff_i_g (Curate) on Jan 29, 2009 at 22:09 UTC
You may find HTML::Entities useful.	[reply]
Re^2: utf-8 problem by Anonymous Monk on Jan 29, 2009 at 22:16 UTC
The file that the program reads is a tab delimited file with the first item being the special character and the second item being the html equivalent. Here is the file: `Ă Á ĂĄ á Ă É ĂŠ Í; Ă í Ă Ñ Ăą ñ Ă Ó Ăł ó Ă Ú Ăş ú Ă Ü Ăź ü Âż ¿ ÂĄ ¡` [download]	[reply] [d/l]
Re^3: utf-8 problem by eff_i_g (Curate) on Jan 29, 2009 at 22:21 UTC
How about this? `use HTML::Entities; use Encode; my $s1 = 'El supersĂłnico de los Indi'; print "Start string: $s1\n"; my $s2 = encode_entities(decode('utf8', $s1)); print "End string: $s2\n";` [download]	[reply] [d/l]
Re^3: utf-8 problem by kennethk (Abbot) on Jan 29, 2009 at 22:29 UTC
eff_i_g has pointed out a nice wheel for you. I strongly suspect the file on your hard drive is not formatted like you think it is.	[reply]
Re^3: utf-8 problem by almut (Canon) on Jan 29, 2009 at 22:36 UTC
That doesn't look like proper UTF-8 to me — though that might merely be a result of the various transformation steps involved in getting the file posted here. It would be better if you could show a hex dump of the file. For example, if you're on Linux, there's typically a tool called `"hexdump"` available, which you could use...	[reply] [d/l]
Re: utf-8 problem by Marshall (Canon) on Jan 30, 2009 at 15:04 UTC
I offer some straight-forward code. Nothing really fancy and no Perl modules needed. This is a translation problem which screams hash table and look-up. Read more... (2 kB) Update: hash table gen is easier than first posted. Read more... (1127 Bytes)	[reply] [d/l] [select]
Re^2: utf-8 problem by ikegami (Patriarch) on Jan 30, 2009 at 15:26 UTC
`map { my ($char, $hmtl) = split} split /\n/, <<'END';` [download] is the same as `split ' ', <<'END';` [download]	[reply] [d/l] [select]
Re^3: utf-8 problem by Marshall (Canon) on Jan 30, 2009 at 17:49 UTC
Quite correct! Even easier is just %table = qw( a value_a b value_b);. I amended my post. I'm not an HTML guy and I suspect that some other escape chars are needed around the translation. But hopefully the combined hints here will help this guy get going!	[reply]
Re: utf-8 problem by bichonfrise74 (Vicar) on Jan 31, 2009 at 17:39 UTC
Try to add: use bytes;	[reply]