I'm no expert on this, but based on my own limited experience (and watching others getting broader experience with it), I'd say it depends on which (human) languages you're dealing with, how many web sites, and which web sites.
If you're talking about sites/pages that are "mostly English with a few funny characters", the pages will often just use HTML entities (e.g. é for é, etc).
Sometimes, the character encoding is specified somewhere in a MIME header or the HTML header, or maybe even in HTML comments. (This is typical if it's an open-standard character set, or a widely-used commercial one, like utf8, iso8859-whatever, Big5, ShiftJIS, GBK, etc.)
Other times (especially when the site is presenting stuff in two or more lanuages on the same page), the character-set info is tucked away in font tags. Worst of all are the Southeast Asian languages (Hindi, Bengali, Tamil, etc) where the font rendering is kinda tough, and various major web sites come up with very different solutions -- i.e. incompatible font encodings -- which means that when you visit one of these sites the first time, you have to download their font in order to read the text. Converting this stuff to any sort of standard character set is a supreme pain in the a**.
Basically, the answer is: there is no general solution -- but if your task is limited to a few sites/languages/character sets, you can get something to work within those bounds. | [reply] |
Hi Graff,
Thanks for the info. I found out that the sites that i grab data from use iso8859 standard where as the java program that i had, expected UTF-8 encoding.
Once that was clear Perl has all the modules to encode/decode to and from whatever format a sane webpage will ever use and it was a piece of cake. Hats off to perl.
Ranjan
| [reply] |