in reply to Re^2: encoding data containing non english characters so as to be decoded by non perl program
in thread encoding data containing non english characters so as to be decoded by non perl program
If you're talking about sites/pages that are "mostly English with a few funny characters", the pages will often just use HTML entities (e.g. é for é, etc).
Sometimes, the character encoding is specified somewhere in a MIME header or the HTML header, or maybe even in HTML comments. (This is typical if it's an open-standard character set, or a widely-used commercial one, like utf8, iso8859-whatever, Big5, ShiftJIS, GBK, etc.)
Other times (especially when the site is presenting stuff in two or more lanuages on the same page), the character-set info is tucked away in font tags. Worst of all are the Southeast Asian languages (Hindi, Bengali, Tamil, etc) where the font rendering is kinda tough, and various major web sites come up with very different solutions -- i.e. incompatible font encodings -- which means that when you visit one of these sites the first time, you have to download their font in order to read the text. Converting this stuff to any sort of standard character set is a supreme pain in the a**.
Basically, the answer is: there is no general solution -- but if your task is limited to a few sites/languages/character sets, you can get something to work within those bounds.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^4: encoding data containing non english characters so as to be decoded by non perl program
by ranjan_jajodia (Monk) on Sep 23, 2004 at 14:59 UTC |