Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to read a char file that has special characters (accented characters) and then make substitutions in my string for the html equivalent when the program finds these chars. Here is my program:
use Encode; use utf8; #use open IO => ':locale'; #my $s = "El supersónico de los Indi "; my $s1 = "El supero de los Indi "; #$s1 = decode_utf8( $s); print "\n\nStart string: $s1\n\n"; my $s2 = &fix_special_characters($s1); print"\nEnd string: $s2\n\n"; sub fix_special_characters { my($string) = @_; open(C,"<:utf8","chars.txt"); my @c = <C>; for(my $i=0; $i < @c; $i++) { my ($special,$htmlchar) = split(/\t/,$c[$i]); print "$special : $htmlchar"; $string =~ s/$special/$htmlchar/ig; ## this is generating +the error message } return $string; }
This is the output when the substitution line (line 30) is commented:
Start string: El supero de los Indi Á : &Aacute; á : &aacute; É : &Eacute; é : &eacute; Í : &Iacute; í : &iacute; Ñ : &Ntilde; ñ : &ntilde; Ó : &Oacute; ó : &oacute; Ú : &Uacute; ú : &uacute; Ü : &Uuml; ü : &uuml; ¿ : &iquest; ¡ : &iexcl;End string: El supero de los Indi
However, when I uncode that substitution line I get the following error messages for every line in the char file:
Malformed UTF-8 character (unexpected non-continuation byte 0x20, imme +diately after start byte 0xc1) in regexp compilation at sp.pl line 30 +, <C> line 16. Malformed UTF-8 character (unexpected non-continuation byte 0x20, imme +diately after start byte 0xc1) in regexp compilation at sp.pl line 30 +, <C> line 16.
I have spent hours trying different methods to make this work with no luck. Any monks out there that can help with this? Thank you

Replies are listed 'Best First'.
Re: utf-8 problem
by kennethk (Abbot) on Jan 29, 2009 at 22:07 UTC

    Given that an essential part of your program is your chars.txt file, that would be helpful to post, particularly as I execute your code as posted and do not get that error.

    I also note from your output formatting that at the least you have some newlines and spaces in your file that are not accounted for in your code. What happens when you substitute the following into your code?

    sub fix_special_characters { my($string) = @_; open(C,"<:utf8","chars.txt"); my @c = <C>; chomp @c; for my $i (0 .. $#c) { my ($special, $htmlchar) = split /\s+/, $c[$i]; print "$special : $htmlchar\n"; $string =~ s/$special/$htmlchar/ig; ## this is generating +the error message } return $string; }

    Note I also swapped away from you C-style for loop: why this is a good idea is discussed in 723825. As well, it's a good idea to use strict; use warnings to avoid needless headaches.

Re: utf-8 problem
by eff_i_g (Curate) on Jan 29, 2009 at 22:09 UTC
      The file that the program reads is a tab delimited file with the first item being the special character and the second item being the html equivalent. Here is the file:
      à &Aacute; á &aacute; à &Eacute; é &Iacute;; í &iacute; à &Ntilde; ñ &ntilde; à &Oacute; ó &oacute; à &Uacute; ú &uacute; à &Uuml; ü &uuml; ¿ &iquest; ¡ &iexcl;
        How about this?
        use HTML::Entities; use Encode; my $s1 = 'El supersónico de los Indi'; print "Start string: $s1\n"; my $s2 = encode_entities(decode('utf8', $s1)); print "End string: $s2\n";
        1. eff_i_g has pointed out a nice wheel for you.
        2. I strongly suspect the file on your hard drive is not formatted like you think it is.

        That doesn't look like proper UTF-8 to me — though that might merely be a result of the various transformation steps involved in getting the file posted here.

        It would be better if you could show a hex dump of the file. For example, if you're on Linux, there's typically a tool called "hexdump" available, which you could use...

Re: utf-8 problem
by Marshall (Canon) on Jan 30, 2009 at 15:04 UTC
    I offer some straight-forward code. Nothing really fancy and no Perl modules needed. This is a translation problem which screams hash table and look-up. Update: hash table gen is easier than first posted.
      map { my ($char, $hmtl) = split} split /\n/, <<'END';

      is the same as

      split ' ', <<'END';
        Quite correct! Even easier is just %table = qw( a value_a b value_b);. I amended my post. I'm not an HTML guy and I suspect that some other escape chars are needed around the translation. But hopefully the combined hints here will help this guy get going!
Re: utf-8 problem
by bichonfrise74 (Vicar) on Jan 31, 2009 at 17:39 UTC
    Try to add:

    use bytes;