(tye)Re: Unicode, HTML, POST

The Monastery is rendered in the Latin-1 character set. To get something outside of that character set, you'll have to use an &-escape. You appear to be using a different character set that puts c-caron in the same spot in the 8-bit range as e-grave is for Latin-1.

Now, searching gets complicated. You have a lot of choices when it comes to storing the information and searching through it. The Monastery uses a scheme that makes your kind of search rather difficult. I hope Google uses a better scheme, but I don't know.

At the Monastery, we store your nodes in HTML encoded in Latin-1. So, for example, say you want to find nodes that contain the "currency" character. Well, here are three currency characters that all look the same but all have to be searched for differently: ¤, ¤, and ¤. View the HTML source for this page to see how they are different.

For the first one, you'd want to search for ¤ which is the character itself in Latin-1, chr(164). For the second one, you could search for #164; because it is stored as ¤ (6 characters). For the last one, you could search for curren; because it is stored as ¤ (8 characters).

Now Google might be smart enough to store the pages in Unicode rather than something like Latin-1. And Google might be smart enough to let you search for ¤ and find it whether it was encode any of those three ways. And Google might be smart enough to let you submit queries in Unicode so that you can specify characters that aren't in Latin-1. And Google might let you use escapes to encode characters that you want to search for so you could specify ¤ in your search. But I don't know if Google does any of those things or not.

So, I'm sorry that this probably doesn't answer your main question. But I hope it clears up some points of confusion you appear to have related to that question. It might even give you some ideas on things you can try now.

Update: Another piece of the puzzle is that web pages can be rendered in different character sets (the headers for the web page, which you usually don't see, tells your browser which character set to expect). Since Google has to deal with all of these different character sets, I'd be a bit surprised if they don't cache them in Unicode.

But I wouldn't be at all surprised if Google does not go to the extra work of translating &-escapes into native Unicode characters before the page is stored nor of translating &-escapes in search requests. Some quick test searches appear to validate that Google does indeed not do this extra work. So you'll probably have to perform multiple searches if you want to find references that got rendered using &-escapes. (/update)

Update2: Which brings up a question. I know that I can send a HTTP request in Unicode. But how, in Latin-1 HTML, can I specify a link that would result in a HTTP request that contains Unicode characters (especially within a CGI parameter value) that aren't part of Latin-1? Is it impossible?

- tye (but my friends call me "Tye")

Comment on (tye)Re: Unicode, HTML, POST Select or Download Code

Replies are listed 'Best First'.
Re: (tye)Re: Unicode, HTML, POST by hossman (Prior) on Apr 13, 2002 at 01:51 UTC
Update2: Which brings up a question. I know that I can send a HTTP request in Unicode. But how, in Latin-1 HTML, can I specify a link that would result in a HTTP request that contains Unicode characters (especially within a CGI parameter value) that aren't part of Latin-1? Is it impossible? What you can't do... While you are correct that HTTP requests can contain Unicode, that's only true in the body of the HTTP request -- after the headers have allready been sent, and a both a mime type & character set (indicating that the body will be some particular Unicode character set) has been provided. You cannot use Unicode (or characters from any arbitrary character set) in the URL of the request itself, because HTTP has no mechanism for specify what the character set might be. (See RFC 2396 for the specifics. (in particular, section 2.1) .. yes the RFC is 4 years old but I can't find anything that superceeds it.) What you can do ... RFC 2718 Clarifies the use of non US-ASCII characters in URLs: Characters should be converted (from whatever you current character set is) to it's UTF-8 representation, and then (if neccessary) escaped each byte into it's %xx hex notation. Whether or not people/systems you send your URL to understand how to put those % escaped bytes back together as a UTF-8 character is between you and them. (Hopefully they're smart enough to pay attention to the first byte)	[reply]

Replies are listed 'Best First'.

Re: (tye)Re: Unicode, HTML, POST
by hossman (Prior) on Apr 13, 2002 at 01:51 UTC

Update2: Which brings up a question. I know that I can send a HTTP request in Unicode. But how, in Latin-1 HTML, can I specify a link that would result in a HTTP request that contains Unicode characters (especially within a CGI parameter value) that aren't part of Latin-1? Is it impossible?

What you can't do...

While you are correct that HTTP requests can contain Unicode, that's only true in the body of the HTTP request -- after the headers have allready been sent, and a both a mime type & character set (indicating that the body will be some particular Unicode character set) has been provided. You cannot use Unicode (or characters from any arbitrary character set) in the URL of the request itself, because HTTP has no mechanism for specify what the character set might be.

(See RFC 2396 for the specifics. (in particular, section 2.1) .. yes the RFC is 4 years old but I can't find anything that superceeds it.)

What you can do ...

RFC 2718 Clarifies the use of non US-ASCII characters in URLs: Characters should be converted (from whatever you current character set is) to it's UTF-8 representation, and then (if neccessary) escaped each byte into it's %xx hex notation.

Whether or not people/systems you send your URL to understand how to put those % escaped bytes back together as a UTF-8 character is between you and them. (Hopefully they're smart enough to pay attention to the first byte)

[reply]