jmay has asked for the wisdom of the Perl Monks concerning the following question:

(Unicode novice alert)

I have the string "Janáček, Leoš" in a Perl variable. The three accented characters should be a-acute, c-caron, and s-caron (looks like the Perlmonks app is converting my c-caron into an e-grave).

Using Unicode codepoints, this is:

my $str = "Jan\x{00E1}\x{010D}ek, Leo\x{0161}";

I want to process this so that it (a) displays correctly in a sufficiently-recent browser, and (b) is correctly encoded into a URL - specifically to construct a Google link to search for Janacek references.

I've figured out (a), but getting stuck on (b). If you cut&paste the string into Google, it turns it into "Janáček, Leoš" (which is fine, 269=0x010D). Hunting for modules that will generate a string that can be cleanly transmitted to Google via URL.

I've tried fooling with Unicode::String, HTML::Entities, and others. No luck so far. Suggestions appreciated.

It looks like a lot of sites have problems confusing c-caron and e-grave (codepoints 010D and 00E8). If you search Google for Janacek-with-e-grave, you'll get more results than for Janacek-with-c-caron. I have no idea why this is.

I'm setting charset=utf-8 everywhere: HTTP header, HTML HEAD meta tag, and FORM accept-charset attribute.

-Jason

Replies are listed 'Best First'.
(tye)Re: Unicode, HTML, POST
by tye (Sage) on Apr 12, 2002 at 22:10 UTC

    The Monastery is rendered in the Latin-1 character set. To get something outside of that character set, you'll have to use an &-escape. You appear to be using a different character set that puts c-caron in the same spot in the 8-bit range as e-grave is for Latin-1.

    Now, searching gets complicated. You have a lot of choices when it comes to storing the information and searching through it. The Monastery uses a scheme that makes your kind of search rather difficult. I hope Google uses a better scheme, but I don't know.

    At the Monastery, we store your nodes in HTML encoded in Latin-1. So, for example, say you want to find nodes that contain the "currency" character. Well, here are three currency characters that all look the same but all have to be searched for differently: ¤, ¤, and ¤. View the HTML source for this page to see how they are different.

    For the first one, you'd want to search for ¤ which is the character itself in Latin-1, chr(164). For the second one, you could search for #164; because it is stored as ¤ (6 characters). For the last one, you could search for curren; because it is stored as ¤ (8 characters).

    Now Google might be smart enough to store the pages in Unicode rather than something like Latin-1. And Google might be smart enough to let you search for ¤ and find it whether it was encode any of those three ways. And Google might be smart enough to let you submit queries in Unicode so that you can specify characters that aren't in Latin-1. And Google might let you use escapes to encode characters that you want to search for so you could specify ¤ in your search. But I don't know if Google does any of those things or not.

    So, I'm sorry that this probably doesn't answer your main question. But I hope it clears up some points of confusion you appear to have related to that question. It might even give you some ideas on things you can try now.

    Update: Another piece of the puzzle is that web pages can be rendered in different character sets (the headers for the web page, which you usually don't see, tells your browser which character set to expect). Since Google has to deal with all of these different character sets, I'd be a bit surprised if they don't cache them in Unicode.

    But I wouldn't be at all surprised if Google does not go to the extra work of translating &-escapes into native Unicode characters before the page is stored nor of translating &-escapes in search requests. Some quick test searches appear to validate that Google does indeed not do this extra work. So you'll probably have to perform multiple searches if you want to find references that got rendered using &-escapes. (/update)

    Update2: Which brings up a question. I know that I can send a HTTP request in Unicode. But how, in Latin-1 HTML, can I specify a link that would result in a HTTP request that contains Unicode characters (especially within a CGI parameter value) that aren't part of Latin-1? Is it impossible?

            - tye (but my friends call me "Tye")
      Update2: Which brings up a question. I know that I can send a HTTP request in Unicode. But how, in Latin-1 HTML, can I specify a link that would result in a HTTP request that contains Unicode characters (especially within a CGI parameter value) that aren't part of Latin-1? Is it impossible?
      What you can't do...

      While you are correct that HTTP requests can contain Unicode, that's only true in the body of the HTTP request -- after the headers have allready been sent, and a both a mime type & character set (indicating that the body will be some particular Unicode character set) has been provided. You cannot use Unicode (or characters from any arbitrary character set) in the URL of the request itself, because HTTP has no mechanism for specify what the character set might be.

      (See RFC 2396 for the specifics. (in particular, section 2.1) .. yes the RFC is 4 years old but I can't find anything that superceeds it.)

      What you can do ...

      RFC 2718 Clarifies the use of non US-ASCII characters in URLs: Characters should be converted (from whatever you current character set is) to it's UTF-8 representation, and then (if neccessary) escaped each byte into it's %xx hex notation.

      Whether or not people/systems you send your URL to understand how to put those % escaped bytes back together as a UTF-8 character is between you and them. (Hopefully they're smart enough to pay attention to the first byte)

Re: Unicode, HTML, POST
by Ovid (Cardinal) on Apr 12, 2002 at 20:40 UTC