First, sorry if I used unicode terms in a confusing way earlier; ucs2 == utf16 (two names for same concept); this is most likely what is being used in your data, not utf-8 (be thankful).

The data sample looks intriguing, and we can have some fun with that. Since utf16 (ucs2) codes are normally referred to in hex notation, let's produce this from the (decimal) numeric html entity references -- ie. convert "&#(\d+);" to the utf16 notation (4-digit hex numbers). A command-line script will suffice:

perl -pe 's/\&\#(\d+);/sprintf("%4.4x ",$1)/ge' < your.post
Here's the first part of what we get:
0a0d 2f00 2a00 2a00 2a00 2a00 2a00 2a00 2000 0900 0d00 0d00 Marina Motchkina 0a0d 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 0900 3000 3400 +2f00 3000 3300 2f00 3000 3200 0d00 0d00 0a0d 0900 5300 5100 4c00 2000 6600 6f00 7200 2000 4400 7800 7800 7800 +2000 6300 6f00 6d00 7000 6f00 6e00 6500 6e00 7400 7300 0d00 0d00 ******/ ...
This reveals a few things: (1) whatever you did to create those decimal numbers, it inverted the byte order of the original utf16 data -- we should be seeing "0053" instead of "5300". (2) Except for one oft-occurring value, the high byte is always null, which means that all these characters are really just ASCII, with the null byte added to turn them into utf16 (e.g. "2000", which is really "0020", is a "space"); (3) the one exceptional value, "0a0d", is of course the traditional MS-DOS/Win 2-byte line termination, viewed as a 16-bit value (but byte-swapped like the other codes).

There are still some mysteries here, like: how is it that there are some standard ASCII (single-byte) characters mixed in with the utf16 stuff, and how would your script be able to handle both types of character data properly? Note that "true" utf16 data would have "000d 000a" as the line termination (using "logical" byte order), not "\r\n".

Any chance you could show a hex dump of an original sample file, before the perl script trashed it?

Moving on to your perl code, here are some initial reactions:

my $text .= concatenate_string_to_insert(); # documentation header %text .= "as\n";
Are you sure that $text contains what you want at this point? (You didn't show what the "concatenate_...()" function does.)
while (defined($line = <OLD>) ) { ... chomp $line; $line .= <OLD>;
This reliance on whatever "$/" happens to be takes me back to the earlier mystery: the input data seems to have utf16 and 8bit ASCII intermixed, but there must be some sort of record separator (not "\r\n") that delineates the two kinds of data in the stream. Find out what creates that delineation, so your script can use it, and read the file in units that contain just one sort of character data at a time.

One last warning/challenge: with ASCII data that is converted to utf16 (by interleaving null bytes), and is then mangled by some unknown problem, the problem can be either unintended byte-swapping (as guessed above) OR unintended loss of a single byte at some point in the stream. Good luck.


In reply to Re: Re: Problems with Unicode files generated with SQL Server by graff
in thread Problems with Unicode files generated with SQL Server by Cuchulain

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.