COMSCBoy has asked for the wisdom of the Perl Monks concerning the following question:

Hi, i'm new here, so be gentle. How would i convert ASCII code (from HTML) to utf-8 in perl? (and back again). I'm writing a welsh on-line spell checker, and the program only understands utf-8, because of ô's and the other accents. Thanks, Aled.

Replies are listed 'Best First'.
Re: ASCII to UTF-8
by IlyaM (Parson) on Jul 26, 2002 at 11:08 UTC
    You do not have to convert ASCII to UTF-8. ASCII is subset of UTF-8.

    If you want to convert Latin-1 to UTF-8 it is another question. Take a look on Unicode::String.

    --
    Ilya Martynov (http://martynov.org/)

Re: ASCII (latin1) to UTF-8 (with sub latin1_to_utf8 & utf8_to_latin1)
by gmpassos (Priest) on Jul 26, 2002 at 23:16 UTC
    Use this 2 functions:
    sub latin1_to_utf8 { return( pack( 'U*', unpack("C*", @_[0] ) ) ); } sub utf8_to_latin1 { return( pack("C*", unpack("U*",@_[0]) ) ); }

    Will be better if you use the modules utf8:: or Encode::. But the advantage of this 2 functions is that they works on Perl 5.6 standart, don't need to upgrade it with new modules. If you are using Perl 5.8 use utf8::, take a look in pod, at 'perlunicode' and 'utf8 manpage' (I dont put any link here because this change a lot between the version of Perl, see your release).

    Don't use this:
    tr///UC;
    This doesn't work anymore!

    * I made the changes like IlyaM says!

    "The creativity is the expression of the liberty".
      Please s/ascii/latin1/. Conversion from ASCII to UTF8 is nonsense. It is NOOP.

      --
      Ilya Martynov (http://martynov.org/)

      I was needing something cheap that would work on both 5.6.x and 5.8.x versions. (I'm trying to keep the code the same while moving back and forth between systems) Tried your utf8_to_latin1() and ran into problems on 5.8.x. The below seems to work on both versions. Thanks for the code.

      sub utf8_to_latin1 { # return( pack("U0C*", unpack( "U*",@_[0]) ) ); return( pack( "C*", unpack("U0U*",@_[0]) ) ); }


      Updated: (see sig) So I go and feed the original (commented out) line into the big program and it fails. It worked in the test program on both 5.6 and 5.8. Wander all over and find the 5.8 perluniintro where they explicitly say

      $native_string = pack("C*", unpack("U*", $Unicode_string));
      just like the original poster. But that didn't work alike on both Perl versions. Played around some more and hit upon the other variant above. This now works in both test and 'real' programs on both Perl versions. (sigh)

      --
      I'm a pessimist about probabilities; I'm an optimist about possibilities. - - Lewis Mumford

Re: ASCII to UTF-8
by COMSCBoy (Initiate) on Jul 26, 2002 at 12:18 UTC
    What I need to do is take input from a HTMl form, make sure it's in utf-8, and use it in a method which spell checks it. At the moment, it spell checks fine, but letters like â, ê and ô are come back as "ê" , "a"(werid looking - smaller and up) and a weird looking "?" in that order. I think I need to change them to latin 1 for printout. And how would i convert w^ and y^ to utf-8. Thanks for all the help, greatly appreciated. :-) Aled.
      Why convert from Latin-1 to UTF-8 at all? As I understand your spellchecker uses UTF-8. All you have to do is ask browser to use same charset. Issue header Content-Type: text/html; charset=UTF-8 and you are done. Form input will be in UTF-8 and you can output UTF-8 too.

      --
      Ilya Martynov (http://martynov.org/)

Re: ASCII to UTF-8 or Spectrum to Blue.
by frankus (Priest) on Jul 26, 2002 at 14:45 UTC
    This is for Welsh language, it's a cool language.
    Not sure why the problem is occuring but...
    I'm rewriting the code to be a bit more concise, then I might be able to:
    1. see what it does
    2. understand your request ;)

    use strict ; use CGI qw/:standard/; use CGI::Carp qw(fatalsToBrowser); use HTML::Template; use Data::Dumper; use Unicode::String; use lemma; my $q = new CGI; my $template = "bbc/lemma/wsillafu.tmpl"; my $tmpl = new HTML::Template( filename =>$template,associate => $q ); + # Consider using an array to make getting and setting easier. # You could then use array slices to set them ;) my $errorchecker = 0; my $memoryoverloadchecker = 0; my $memorychecker = 0; my $dictionarychecker = 0; my $otherchecker = 0; my $memoryfinder = 0; print $q->header(); if ($q->param()) { my ( $spellingcheck, $error) = lookup($q->param("brawddeg")); # Is t +his pronounced brahtheg? $tmpl->param( spellingcheck => $spellingcheck ); ($errorchecker,$memoryfinder) = (1,1) if $error == 1; ($memoryoverloadchecker,$memoryfinder) = (1,1) if $error == -1; ($memorychecker, $memoryfinder) = (1,1) if $error == -2; ($dictionarychecker, $memoryfinder) = (1,1) if $error == -3; ($otherchecker,$memoryfinder) = (1,1) if $error == -4; } print $tmpl->output; if( $dictionarychecker ) { print "<br><br>Nid yw\'r geiriadur ar-lein ar gael. E-bostiwch meis +tr y wefan am gymorth \n"; } if( $otherchecker ){ print "<br><br>Mae gwall yn y gwirydd. E-bostwich meistr y wefan a +m gymorth \n"; } if( $memoryoverloadchecker ){ print "<br><br>Roedd mwy na 10 gwall yn y testun. Dim ond y 10 gwa +ll cyntaf sydd wedi ei cywiro.<br> Cywirwch rhain cyn cario ymlaen i gywiro.\n"; } if( $memorychecker ) { print "<br><br>Mae'r gwirydd wedi rhedeg allan o gof. E-bostiwch m +eistr y wefan am gymorth.\n"; } if( $errorchecker ) { print "<br><br><font color=\"red\"><b>Mae rhai gwallau yng nghorff +y testun.</b></font><br> Dewiswch air â gynhigir yn y blwch tynnu i lawr i bob gair sy +dd wedi ei sillafu'n anghywir. <br>Yna gwagswch y botwm gwirio isod i gael eich testun wedi +ei wirio.<br>\n"; print "<form name=\"ArgraffyddTestun\" method=\"POST\" Action=\"wsi +llafudangos.pl\">\n"; print "<input type=\"Submit\" name=\"Gwirio\" value=\"Gwirio\"></fo +rm>\n"; } unless( $memoryfinder ) { print "<br><br><b>Nid oedd gwall yn y frawddeg</b>\n"; } sub lookup { # Don't think you want a prototype as you had it ;) $_ = shift; # the @_ is redundant, it's implicit. my ($errors, $str) = lemma::GetCysillSpellingErrors($_, 10); my @fields = split ("(<awgrymiadau>[^<]*</awgrymiadau>)", $str); my $error = @fields > 1; my @spellingcheck = (); return (\@spellingcheck, 'odd amount of values') unless @fields % +2; # Remove the last two items my $last_field = pop(@fields); pop(@fields); my ($key,$value); while ( ($key,$value,@fields) = @fields ) { $value =~ s/<[^>]*>//g; my @SuggestionList = split /,/, $value; # Consider Text::CS +V here ;) # Saves a bit of space. my @suggestions = map{suggestion=> $_}@SuggestionList; # Create an anonymous hash push @spellingcheck, {text=> $key, thesuggestions=>\@suggestion +s}; } return ([ @spellingcheck,{text=> $last_field} ], $errors) ; }
    This should do the same things, with a bit more clarity

    However I can't see anything here that refers to Latin or UTF-8.. where is that glitch occuring?

    --

    Brother Frankus.

    ¤

Re: ASCII to UTF-8
by COMSCBoy (Initiate) on Jul 26, 2002 at 13:09 UTC
    Ok, but i'm not sure where to put that, as my script doesn't work in quite the same way as you think. Here's the script, if you can help. I know it looks very messy, but i'll clean it up after it works. ;-) Thanks again, Aled. ------------------------------------
    use strict ; use CGI qw/:standard/; use CGI::Carp qw(fatalsToBrowser); use HTML::Template; use Data::Dumper; use Unicode::String; my $q = new CGI; print $q->header(); #my $template = "wsillafu.tmpl"; my $template = "bbc/lemma/wsillafu.tmpl"; my $tmpl = new HTML::Template( filename => $template, associate => $ +q ); my $errorchecker = "0";my $memoryoverloadchecker = "0"; my $memorych +ecker = "0"; my $dictionarychecker = "0";my $otherchecker = "0";my $memoryfinder += "0"; if ($q->param()) { (my $spellingcheck, my $error) = lookup($q->param("brawddeg +")); #print Dumper( $spellingcheck ); #print "error = $error<br>"; $tmpl->param( spellingcheck => $spellingcheck ); if($error == 1){$errorchecker = "1"; $memoryfinder = "1";} if($error == -1){$memoryoverloadchecker = "1"; $memoryfinder + = "1";} if($error == -2){$memorychecker = "1"; $memoryfinder = "1";} if($error == -3){$dictionarychecker = "1"; $memoryfinder = "1"; +} if($error == -4){$otherchecker = "1"; $memoryfinder = "1";} } print $tmpl->output; if( $dictionarychecker eq "1" ) { print "<br><br>Nid yw\'r geiriadur ar-lein ar gael. E-bostiw +ch meistr y wefan am gymorth \n"; } if( $otherchecker eq "1" ) { print "<br><br>Mae gwall yn y gwirydd. E-bostwich meistr y w +efan am gymorth \n"; } if( $memoryoverloadchecker eq "1" ) { print "<br><br>Roedd mwy na 10 gwall yn y testun. Dim ond y +10 gwall cyntaf sydd wedi ei cywiro.<br> Cywirwch rhain cyn cario ymlaen i gywiro.\n"; } if( $memorychecker eq "1" ) { print "<br><br>Mae'r gwirydd wedi rhedeg allan o gof. E-bost +iwch meistr y wefan am gymorth.\n"; } if( $errorchecker eq "1" ) { print "<br><br><font color=\"red\"><b>Mae rhai gwallau yng ng +horff y testun.</b></font><br> Dewiswch air â gynhigir yn y blwch tynnu i lawr i bob gair sy +dd wedi ei sillafu'n anghywir. <br>Yna gwagswch y botwm gwirio isod i gael eich testun wedi +ei wirio.<br>\n"; print "<form name=\"ArgraffyddTestun\" method=\"POST\" Action +=\"wsillafudangos.pl\">\n"; print "<input type=\"Submit\" name=\"Gwirio\" value=\"Gwirio\ +"></form>\n"; } if( $memoryfinder eq "0" ) {print "<br><br><b>Nid oedd gwall yn y + frawddeg</b>\n";} use lemma; sub lookup() { $_ = shift(@_); #print "Paramater: $_ \n"; #my $br = ""; #$br->String::latin1( $_ ); #print "Latin1 $br \n"; (my $errors, my $str) = lemma::GetCysillSpellingErrors($_, 10); #print "$str\n"; my @fields = split ("(<awgrymiadau>[^<]*</awgrymiadau>)", $str); #print Dumper(@fields); my $error = ( @fields > 1); my @spellingcheck = (); for(my $i=0; $i<@fields; $i+=2) { my %pair; if($i+1 == @fields) { %pair = (text=> @fields[$i]); } else { @fields[$i+1] =~ s/<[^>]*>//g; my @SuggestionList = split /,/, @fields[$i+1]; #print "SuggestionList = ",Dumper(@SuggestionList); my @suggestions = (); my $suggestion; foreach $suggestion (@SuggestionList) { my %hash = (suggestion=> $suggestion); push @suggestions, \%hash; } %pair = (text=> @fields[$i], thesuggestions=>\@suggestions) +; } push @spellingcheck, \%pair; } #print Dumper(@spellingcheck); return (\@spellingcheck, $errors) ; }
    -----------------------

      If I'm correct in guessing that you're asking where to put the header information to set the character set, change your line:

      print $q->header();

      to the following (untested):

      print $q->header(-type => 'text/html', -charset => 'UTF-8' );

      -rattus

      __________
      He seemed like such a nice guy to his neighbors / Kept to himself and never bothered them with favors
      - Jefferson Airplane, "Assassin"