What do you mean by "parse a utf8-encoded RSS stream"?

If you mean there is wide-character content (utf8-encoded) in your input, and you need to translate that into an "equivalent" single-byte encoding, then you are being to vague about the problem. Can/does the input contain data in multiple languages, and might this require that you need to choose one or another "iso-8859-*" depending on the language? (There are sixteen different flavors of iso-8859; it could make a big difference whether you need just one of them or more than one of them for your input.)

Also, depending on how unicode is being used in your source data, you might need something besides iso-8859, if you need to preserve stuff like "specialized" versions of quotation marks, dashes, etc (that's cp12* territory). Converting these to their plain-ASCII equivalents is easy enough, if appropriate, and just takes a little bit of study on what the data actually contains.

The structure of utf8 is such that it is actually pretty easy to parse using binary methods (testing, masking and shifting specific bits). ASCII characters are just ASCII characters; every non-ASCII (wide) character is two or more consecutive bytes with the high-bit set, and the boundaries between consecutive wide characters are unambiguous, based on how many high-bits are set in a given byte. There's a pretty good explanation of this in the "Unicode Encodings" section of the perlunicode man page. The main unicode.org web site is also an excellent resource.

So, if the tools on hand are insufficent to do "real" character encoding conversions, just do some research on the data to figure out what sorts of wide characters you are getting, and map out a hash table to convert those two- or three-byte patterns to whatever single-byte "equivalent" seems appropriate. If the input is likely to introduce "new" utf8 patterns over time, just come up with a method to flag wide characters that are not yet tabulated in your replacement hash, and have a procedure to do something appropriate with that information (e.g. get someone to figure out what the new character should be mapped to and add it to the replacement hash).

OTOH, maybe it's sufficient just to store the utf8 data "as-is" in the database -- that is, don't try to "parse" it with the legacy system -- and have some other, more up-to-date system read from the DB in order to do whatever conversion needs to be done (or just use the data as utf8 text). The DB itself should be neutral about the byte values stored in a "varchar" field -- though you may want to define this field as having the "binary" attribute (cf. mysql docs on "binary text fields").


In reply to Re: utf8 and perl 5.6 by graff
in thread utf8 and perl 5.6 by domm

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.