accents and diacritical marks in a web page

punkish has asked for the wisdom of the Perl Monks concerning the following question:

Solved: (see original post below this solution).

Problem: working on my Mac, I had $var = "dinámico" which would print like dinÃ¡mico

Reason: the non-English text was stored in my Cocoa-based text editor as UTF-8. Printing it directly would print gobbledegook. Even trying to escape it using HTML::Entities would not work, as that would just escape gobbledegook.

Solution: First decode the UTF-8, then encode it using HTML::Entities. So, use Encode; use HTML::Entities; encode_entities(decode_utf8($var)) does the trick. Now I get dinámico which prints fine in the browser.

The following web page served from my Macbook Pro, plain html, stock untinkered Apache, renders just fine.

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
    "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
  <meta http-equiv="content-type" content="text/html; charset=utf-8">
  <title>test of accents</title>
</head>
<body>
  <h1>English</h1>
  <p>explores the complex dynamic between people and conservation 
  as part of the mission</p>
  <h1>Spanish</h1>
  <p>explora el complejo dinámico entre la gente y la conservación 
  como parte de la misión</p>
  <h1>Portuguese</h1>
  <p>explora a dinâmica complexa entre pessoas e a conservação 
  como parte da missão</p>
</body>
</html>
[download]

The same page served via Perl/HTML::Template/CGI::App mechanism renders like crap, unless I go to my browser and change the text encoding to Mac-Roman (this is not required in the above case that works just fine, as is).

# in my Perl script
sub getInterfaceText {
  my ($lang) = @_;

  my ($msg);

  my %text = (
    "en" => {
      "msg" => qq|
        explores the complex dynamic between people and 
        conservation as part of the mission
      |,
    },
    "es" => {
      ;msg" => qq|
        explora el complejo dinámico entre la gente y la 
        conservación como parte de la misión
      |,
    },
    "pt" => {
      "msg" => qq|
        explora a dinâmica complexa entre pessoas e a 
        conservação como parte da missão
      |,
    },
  );

  return $text{$lang}->{msg};
}

my $lang = $cgi->param('lang') || 
  substr(lc $ENV{"HTTP_ACCEPT_LANGUAGE"}, 0, 2) || 
  "en";
my $msg = getInterfaceText($lang);
my $tmpl->param(LANG => $lang, MSG => $msg,);
#----
# in my html page retreived as http://path/to/webpage/?lang=es
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
    "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
  <meta http-equiv="content-type" content="text/html; charset=utf-8">
  <title>test of accents</title>
</head>
<body>
  <h1><TMPL_VAR LANG></h1>
  <p><TMPL_VAR MSG></p>
</body>
</html>
[download]

I know my solution lies somewhere at the intersection of content-negotiation, unicode, and such mysteries that I know little about. I need to host this on a shared web server (aka, plain-vanilla, not-in-my-control webserver). The application also involves a database in which stuff is stored, and that stuff also has accents which similarly get clobbered when they are displayed in a web form and then updated. What can I do so this doesn't happen?

Update: Background explanation in the hope that it might lead to a better solution -- I am making an application that will be served in many different languages as far as the interface text is concerned. I could, of course, make separate html templates for each language, add language suffixes (.en, .es, .pt and so on), make sure the text in each of the templates is html escaped (why some needs to be while other doesn't still escapes me!), and serve based on $ENV{"HTTP_ACCEPT_LANGUAGE"} or explicitly chosen language or whatever. The problem is that I would have to maintain all those different templates. Make a change in one, and I would have to make a change in all. By making one template, and substituting the text strings accordingly... well, you get the idea... it is a lot better... one template to maintain. Right now I have 3 languages. I will probably end up getting 3 or 4 more languages.

--

when small people start casting long shadows, it is time to go to bed

Comment on accents and diacritical marks in a web page Select or Download Code

Replies are listed 'Best First'.
Re: accents and diacritical marks in a web page by shmem (Chancellor) on Sep 09, 2007 at 23:53 UTC
I guess you can solve your problem with HTML::Entities. --shmem _($_=" "x(1<<5)."?\n".q·/)Oo. G°\ / /\_¯/(q / ---------------------------- \__(m.====·.(_("always off the crowd"))."· ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}	[reply]
Re^2: accents and diacritical marks in a web page by punkish (Priest) on Sep 10, 2007 at 03:19 UTC
yes indeed, very nice, thank you. Of course, this means, I have to run everything through HTML::Entities. As I said elsewhere, besides the various interface text, I also have data in a db (currently SQLite) that has text in it in various languages, and that data already has accents as required. When I bring that data up in forms, it shows correctly, but when that form is sent back updated to the database, the accents get clobbered. What it means is the HTML::Entities has to become the gateway for all text going in or out. What a pain. Yet, when I write up the plain text file without escaping all the non-ascii text, it shows up just fine in the browser. -- when small people start casting long shadows, it is time to go to bed	[reply]
Re: accents and diacritical marks in a web page by Joost (Canon) on Sep 10, 2007 at 00:09 UTC
...renders like crap, unless I go to my browser and change the text encoding to Mac-Roman Ok, 1. what does "like crap" mean? 2. your html `<meta>` tag claims the HTML printed from your perl code is utf-8 encoded. Usually that means you should set `binmode(STDOUT,":utf8")` prior to printing anything (also: you should make sure you read any non-7-bit-ascii input correctly) "What should it profit a man, if he should win a flame war, yet lose his cool?"	[reply] [d/l] [select]
Re^2: accents and diacritical marks in a web page by punkish (Priest) on Sep 10, 2007 at 00:28 UTC
1. what does "like crap" mean? like crap means `explora el complejo dinÃ¡mico entre la gente y la conservaciÃ³n como parte de la misiÃ³n instead of explora el complejo dinámico entre la gente y la conservación como parte de la misión` [download] 2. your html <meta> tag claims the HTML printed from your perl code is utf-8 encoded... that means I don't know UTF from my butt. I was simply trying different meta tags to try and make my web page claim something that would be understood by my browser -- I tried utf-8 along with x-mac-roman as well as iso-8859-1. I was just shooting in the dark hoping something will stick until I got the brilliant idea that I should ask those who know more than I do, that is, you monks. update: and note my OP above... this same utf-8 incantation exists in the plain html file with different languages, and that renders just fine on the same computer, same web server, everything. Its just that when put through my perl application, the languages get mangled. -- when small people start casting long shadows, it is time to go to bed	[reply] [d/l]
Re^3: accents and diacritical marks in a web page by Joost (Canon) on Sep 10, 2007 at 11:24 UTC
Seeing this, and reading your update, it's clear that the perl script itself is UTF-8 encoded. In that case you probably should use the utf8 pragma to indicate that that is the case (basically it just means you don't have to decode() all your string literals). AFAIK you can then use HTML::Entities directly, or set STDOUT to :utf8 as I suggested above, or both. "What should it profit a man, if he should win a flame war, yet lose his cool?"	[reply]
Re^4: accents and diacritical marks in a web page by punkish (Priest) on Sep 10, 2007 at 16:12 UTC
Re^5: accents and diacritical marks in a web page by Joost (Canon) on Sep 10, 2007 at 16:24 UTC