I'm no expert on this, but based on my own limited experience (and watching others getting broader experience with it), I'd say it depends on which (human) languages you're dealing with, how many web sites, and which web sites.

If you're talking about sites/pages that are "mostly English with a few funny characters", the pages will often just use HTML entities (e.g. é for é, etc).

Sometimes, the character encoding is specified somewhere in a MIME header or the HTML header, or maybe even in HTML comments. (This is typical if it's an open-standard character set, or a widely-used commercial one, like utf8, iso8859-whatever, Big5, ShiftJIS, GBK, etc.)

Other times (especially when the site is presenting stuff in two or more lanuages on the same page), the character-set info is tucked away in font tags. Worst of all are the Southeast Asian languages (Hindi, Bengali, Tamil, etc) where the font rendering is kinda tough, and various major web sites come up with very different solutions -- i.e. incompatible font encodings -- which means that when you visit one of these sites the first time, you have to download their font in order to read the text. Converting this stuff to any sort of standard character set is a supreme pain in the a**.

Basically, the answer is: there is no general solution -- but if your task is limited to a few sites/languages/character sets, you can get something to work within those bounds.


In reply to Re^3: encoding data containing non english characters so as to be decoded by non perl program by graff
in thread encoding data containing non english characters so as to be decoded by non perl program by ranjan_jajodia

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.