The data sample looks intriguing, and we can have some fun with that. Since utf16 (ucs2) codes are normally referred to in hex notation, let's produce this from the (decimal) numeric html entity references -- ie. convert "&#(\d+);" to the utf16 notation (4-digit hex numbers). A command-line script will suffice:
Here's the first part of what we get:perl -pe 's/\&\#(\d+);/sprintf("%4.4x ",$1)/ge' < your.post
This reveals a few things: (1) whatever you did to create those decimal numbers, it inverted the byte order of the original utf16 data -- we should be seeing "0053" instead of "5300". (2) Except for one oft-occurring value, the high byte is always null, which means that all these characters are really just ASCII, with the null byte added to turn them into utf16 (e.g. "2000", which is really "0020", is a "space"); (3) the one exceptional value, "0a0d", is of course the traditional MS-DOS/Win 2-byte line termination, viewed as a 16-bit value (but byte-swapped like the other codes).0a0d 2f00 2a00 2a00 2a00 2a00 2a00 2a00 2000 0900 0d00 0d00 Marina Motchkina 0a0d 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 0900 3000 3400 +2f00 3000 3300 2f00 3000 3200 0d00 0d00 0a0d 0900 5300 5100 4c00 2000 6600 6f00 7200 2000 4400 7800 7800 7800 +2000 6300 6f00 6d00 7000 6f00 6e00 6500 6e00 7400 7300 0d00 0d00 ******/ ...
There are still some mysteries here, like: how is it that there are some standard ASCII (single-byte) characters mixed in with the utf16 stuff, and how would your script be able to handle both types of character data properly? Note that "true" utf16 data would have "000d 000a" as the line termination (using "logical" byte order), not "\r\n".
Any chance you could show a hex dump of an original sample file, before the perl script trashed it?
Moving on to your perl code, here are some initial reactions:
Are you sure that $text contains what you want at this point? (You didn't show what the "concatenate_...()" function does.)my $text .= concatenate_string_to_insert(); # documentation header %text .= "as\n";
This reliance on whatever "$/" happens to be takes me back to the earlier mystery: the input data seems to have utf16 and 8bit ASCII intermixed, but there must be some sort of record separator (not "\r\n") that delineates the two kinds of data in the stream. Find out what creates that delineation, so your script can use it, and read the file in units that contain just one sort of character data at a time.while (defined($line = <OLD>) ) { ... chomp $line; $line .= <OLD>;
One last warning/challenge: with ASCII data that is converted to utf16 (by interleaving null bytes), and is then mangled by some unknown problem, the problem can be either unintended byte-swapping (as guessed above) OR unintended loss of a single byte at some point in the stream. Good luck.
In reply to Re: Re: Problems with Unicode files generated with SQL Server
by graff
in thread Problems with Unicode files generated with SQL Server
by Cuchulain
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |