comment on

Solved: (see original post below this solution).

Problem: working on my Mac, I had $var = "dinámico" which would print like dinÃ¡mico

Reason: the non-English text was stored in my Cocoa-based text editor as UTF-8. Printing it directly would print gobbledegook. Even trying to escape it using HTML::Entities would not work, as that would just escape gobbledegook.

Solution: First decode the UTF-8, then encode it using HTML::Entities. So, use Encode; use HTML::Entities; encode_entities(decode_utf8($var)) does the trick. Now I get dinámico which prints fine in the browser.

The following web page served from my Macbook Pro, plain html, stock untinkered Apache, renders just fine.

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
    "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
  <meta http-equiv="content-type" content="text/html; charset=utf-8">
  <title>test of accents</title>
</head>
<body>
  <h1>English</h1>
  <p>explores the complex dynamic between people and conservation 
  as part of the mission</p>
  <h1>Spanish</h1>
  <p>explora el complejo dinámico entre la gente y la conservación 
  como parte de la misión</p>
  <h1>Portuguese</h1>
  <p>explora a dinâmica complexa entre pessoas e a conservação 
  como parte da missão</p>
</body>
</html>
[download]

The same page served via Perl/HTML::Template/CGI::App mechanism renders like crap, unless I go to my browser and change the text encoding to Mac-Roman (this is not required in the above case that works just fine, as is).

# in my Perl script
sub getInterfaceText {
  my ($lang) = @_;

  my ($msg);

  my %text = (
    "en" => {
      "msg" => qq|
        explores the complex dynamic between people and 
        conservation as part of the mission
      |,
    },
    "es" => {
      ;msg" => qq|
        explora el complejo dinámico entre la gente y la 
        conservación como parte de la misión
      |,
    },
    "pt" => {
      "msg" => qq|
        explora a dinâmica complexa entre pessoas e a 
        conservação como parte da missão
      |,
    },
  );

  return $text{$lang}->{msg};
}

my $lang = $cgi->param('lang') || 
  substr(lc $ENV{"HTTP_ACCEPT_LANGUAGE"}, 0, 2) || 
  "en";
my $msg = getInterfaceText($lang);
my $tmpl->param(LANG => $lang, MSG => $msg,);
#----
# in my html page retreived as http://path/to/webpage/?lang=es
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
    "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
  <meta http-equiv="content-type" content="text/html; charset=utf-8">
  <title>test of accents</title>
</head>
<body>
  <h1><TMPL_VAR LANG></h1>
  <p><TMPL_VAR MSG></p>
</body>
</html>
[download]

I know my solution lies somewhere at the intersection of content-negotiation, unicode, and such mysteries that I know little about. I need to host this on a shared web server (aka, plain-vanilla, not-in-my-control webserver). The application also involves a database in which stuff is stored, and that stuff also has accents which similarly get clobbered when they are displayed in a web form and then updated. What can I do so this doesn't happen?

Update: Background explanation in the hope that it might lead to a better solution -- I am making an application that will be served in many different languages as far as the interface text is concerned. I could, of course, make separate html templates for each language, add language suffixes (.en, .es, .pt and so on), make sure the text in each of the templates is html escaped (why some needs to be while other doesn't still escapes me!), and serve based on $ENV{"HTTP_ACCEPT_LANGUAGE"} or explicitly chosen language or whatever. The problem is that I would have to maintain all those different templates. Make a change in one, and I would have to make a change in all. By making one template, and substituting the text strings accordingly... well, you get the idea... it is a lot better... one template to maintain. Right now I have 3 languages. I will probably end up getting 3 or 4 more languages.

--

when small people start casting long shadows, it is time to go to bed

In reply to accents and diacritical marks in a web page by punkish

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.