in reply to Unicode, HTML, POST
The Monastery is rendered in the Latin-1 character set. To get something outside of that character set, you'll have to use an &-escape. You appear to be using a different character set that puts c-caron in the same spot in the 8-bit range as e-grave is for Latin-1.
Now, searching gets complicated. You have a lot of choices when it comes to storing the information and searching through it. The Monastery uses a scheme that makes your kind of search rather difficult. I hope Google uses a better scheme, but I don't know.
At the Monastery, we store your nodes in HTML encoded in Latin-1. So, for example, say you want to find nodes that contain the "currency" character. Well, here are three currency characters that all look the same but all have to be searched for differently: ¤, ¤, and ¤. View the HTML source for this page to see how they are different.
For the first one, you'd want to search for ¤ which is the character itself in Latin-1, chr(164). For the second one, you could search for #164; because it is stored as ¤ (6 characters). For the last one, you could search for curren; because it is stored as ¤ (8 characters).
Now Google might be smart enough to store the pages in Unicode rather than something like Latin-1. And Google might be smart enough to let you search for ¤ and find it whether it was encode any of those three ways. And Google might be smart enough to let you submit queries in Unicode so that you can specify characters that aren't in Latin-1. And Google might let you use escapes to encode characters that you want to search for so you could specify ¤ in your search. But I don't know if Google does any of those things or not.
So, I'm sorry that this probably doesn't answer your main question. But I hope it clears up some points of confusion you appear to have related to that question. It might even give you some ideas on things you can try now.
Update: Another piece of the puzzle is that web pages can be rendered in different character sets (the headers for the web page, which you usually don't see, tells your browser which character set to expect). Since Google has to deal with all of these different character sets, I'd be a bit surprised if they don't cache them in Unicode.
But I wouldn't be at all surprised if Google does not go to the extra work of translating &-escapes into native Unicode characters before the page is stored nor of translating &-escapes in search requests. Some quick test searches appear to validate that Google does indeed not do this extra work. So you'll probably have to perform multiple searches if you want to find references that got rendered using &-escapes. (/update)
Update2: Which brings up a question. I know that I can send a HTTP request in Unicode. But how, in Latin-1 HTML, can I specify a link that would result in a HTTP request that contains Unicode characters (especially within a CGI parameter value) that aren't part of Latin-1? Is it impossible?
- tye (but my friends call me "Tye")
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: (tye)Re: Unicode, HTML, POST
by hossman (Prior) on Apr 13, 2002 at 01:51 UTC |