Cuchulain has asked for the wisdom of the Perl Monks concerning the following question:
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Problems with Unicode files generated with SQL Server
by wardk (Deacon) on Apr 30, 2002 at 15:00 UTC | |
If you are using FreeTDS libs, I suspect this would still be valid. Not sure how helpful this is, but perhaps it's a start. good luck!! | [reply] |
|
Re: Problems with Unicode files generated with SQL Server
by strat (Canon) on Apr 30, 2002 at 13:50 UTC | |
Best regards, | [reply] |
|
Re: Problems with Unicode files generated with SQL Server
by graff (Chancellor) on Apr 30, 2002 at 20:27 UTC | |
If you are using the term "unicode" to refer to "ucs2" or "utf16" (i.e. full 16-bit unicode encoding, as opposed to utf8, which is variable-width encoding), then one thing that might be screwing you up is byte order. Does your existing utf16 use network (big-endian) or wintel byte order? (Do the unicode strings start with byte-order marks?) What do you mean by "garbage"? Is it that your perl script cannot display the unicode strings in any intelligible manner, or is the data getting "updated" or "rearranged" or otherwise "filtered" in some inappropriate way? Do you want to keep everything in unicode, or would you rather convert to ANSI? If you'd like folks to send you answers rather than questions, give us some snippets of code, and some input and output that illustrate the problem. UPDATE: Forgive me, that last bit was unfair -- I haven't tried posting unicode data to this site yet, and it may not be all that easy to do. But a sample of code that illustrates what you are trying to do would be very helpful, as well as some additional detail about what the source data looks like (e.g. give us a short string of byte pairs in hex notation), and a better idea of what your script is producing, and what you actually want it to produce. | [reply] |
|
Re: Problems with Unicode files generated with SQL Server
by Cuchulain (Initiate) on May 01, 2002 at 09:12 UTC | |
I'll try a few of those suggestions, I guess the question is kind of acedemic because I can script the procedures to ANSI - but as they were already stored in source-safe as unicode - i thought that a perl solution that handled unicode would be better (save the hassle of organising for all procedures to be checked in, etc) On the version of unicode being used: I do not know - books on-line just say International(Unicode) "Select this option if the script uses special international characters that are supported only in the Unicode font." Here's the PERL code When I say garbage I mean something like the following...
graff - you're right about posting unicode to the site - it looked nicer than this - but still garbage. Thanks anyway - I'm off to find out about 'ucs2' & 'utf16' & install the Unicode::String module. | [reply] [d/l] [select] |
by graff (Chancellor) on May 02, 2002 at 04:17 UTC | |
The data sample looks intriguing, and we can have some fun with that. Since utf16 (ucs2) codes are normally referred to in hex notation, let's produce this from the (decimal) numeric html entity references -- ie. convert "&#(\d+);" to the utf16 notation (4-digit hex numbers). A command-line script will suffice: Here's the first part of what we get: This reveals a few things: (1) whatever you did to create those decimal numbers, it inverted the byte order of the original utf16 data -- we should be seeing "0053" instead of "5300". (2) Except for one oft-occurring value, the high byte is always null, which means that all these characters are really just ASCII, with the null byte added to turn them into utf16 (e.g. "2000", which is really "0020", is a "space"); (3) the one exceptional value, "0a0d", is of course the traditional MS-DOS/Win 2-byte line termination, viewed as a 16-bit value (but byte-swapped like the other codes). There are still some mysteries here, like: how is it that there are some standard ASCII (single-byte) characters mixed in with the utf16 stuff, and how would your script be able to handle both types of character data properly? Note that "true" utf16 data would have "000d 000a" as the line termination (using "logical" byte order), not "\r\n". Any chance you could show a hex dump of an original sample file, before the perl script trashed it? Moving on to your perl code, here are some initial reactions: Are you sure that $text contains what you want at this point? (You didn't show what the "concatenate_...()" function does.) This reliance on whatever "$/" happens to be takes me back to the earlier mystery: the input data seems to have utf16 and 8bit ASCII intermixed, but there must be some sort of record separator (not "\r\n") that delineates the two kinds of data in the stream. Find out what creates that delineation, so your script can use it, and read the file in units that contain just one sort of character data at a time. One last warning/challenge: with ASCII data that is converted to utf16 (by interleaving null bytes), and is then mangled by some unknown problem, the problem can be either unintended byte-swapping (as guessed above) OR unintended loss of a single byte at some point in the stream. Good luck. | [reply] [d/l] [select] |