Given that you are stuck with using MS Windows, and given that the non-ascii data really is single-byte-per-character (and needs to be kept that way), you should be writing your fixed-width file using one of the "legacy" MS character encodings.

Which one you need depends on which language is being used in the data: what are the non-ascii characters? If they are just the basic Latin-1 set for "Western" languages (French, German, Spanish) then you probably want CP1252. Or it could just be that the non-ascii characters are those nefarious "smart quotes" and other bothersome punctuation marks being foisted on us all, in which case any CP12?? charset will do.

You can use binmode on your output file handle to impose the conversion from perl-internal utf8 to cp1252 (or whatever); this way, no information is lost, 8-bit characters remain 8-bit characters, and the fixed-width lines get the right byte count:

binmode OUTFH, ":encoding(cp1252)";

(Someday, the boss might get the idea that the downstream process that needs the fixed-width file as input ought to accommodate utf8 data, and at that point you'll need to take out the binmode line, or maybe just change ":encoding(cp1252)" with ":utf8".)


In reply to Re: UTF-8 Decoding, Wide Characters, and XML::Twig by graff
in thread UTF-8 Decoding, Wide Characters, and XML::Twig by thedoe

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.