Problems with Unicode files generated with SQL Server

Cuchulain has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Problems with Unicode files generated with SQL Server by wardk (Deacon) on Apr 30, 2002 at 15:00 UTC
I am assuming SQL Server is MSSQL. Just checked my DBI.pm book under DBD:Sybase. on the connect, you can pass "charset", this will auto-convert the character set for you. (unfortunately it doesn't list the available charsets). If you are using FreeTDS libs, I suspect this would still be valid. Not sure how helpful this is, but perhaps it's a start. good luck!!	[reply]
Re: Problems with Unicode files generated with SQL Server by strat (Canon) on Apr 30, 2002 at 13:50 UTC
How do you access the SQL-Server (i guess you're talking about MsSql) and which Version of MsSql is it? Best regards, perl -le "s==F=e=>y~\martinF~stronat~=>s~[^\w]~~g=>chop,print"	[reply]
Re: Problems with Unicode files generated with SQL Server by graff (Chancellor) on Apr 30, 2002 at 20:27 UTC
Look at Unicode::String, to see if this gives you anything you can use. If you are using the term "unicode" to refer to "ucs2" or "utf16" (i.e. full 16-bit unicode encoding, as opposed to utf8, which is variable-width encoding), then one thing that might be screwing you up is byte order. Does your existing utf16 use network (big-endian) or wintel byte order? (Do the unicode strings start with byte-order marks?) What do you mean by "garbage"? Is it that your perl script cannot display the unicode strings in any intelligible manner, or is the data getting "updated" or "rearranged" or otherwise "filtered" in some inappropriate way? Do you want to keep everything in unicode, or would you rather convert to ANSI? If you'd like folks to send you answers rather than questions, give us some snippets of code, and some input and output that illustrate the problem. UPDATE: Forgive me, that last bit was unfair -- I haven't tried posting unicode data to this site yet, and it may not be all that easy to do. But a sample of code that illustrates what you are trying to do would be very helpful, as well as some additional detail about what the source data looks like (e.g. give us a short string of byte pairs in hex notation), and a better idea of what your script is producing, and what you actually want it to produce.	[reply]
Re: Problems with Unicode files generated with SQL Server by Cuchulain (Initiate) on May 01, 2002 at 09:12 UTC
Thanks guys, I'm using MS SQL Server 2000 Enterprise but I'm not connecting with the DBI - but with the MS DMO Library. I'll try a few of those suggestions, I guess the question is kind of acedemic because I can script the procedures to ANSI - but as they were already stored in source-safe as unicode - i thought that a perl solution that handled unicode would be better (save the hassle of organising for all procedures to be checked in, etc) On the version of unicode being used: I do not know - books on-line just say International(Unicode) "Select this option if the script uses special international characters that are supported only in the Unicode font." Here's the PERL code sub insert_the_string { my $old_file = shift; my $new_file = $old_file . "sql2"; my $line = ''; open(OLD, "< $old_file"); open(NEW, "> $new_file"); my $text .= concatenate_string_to_insert(); # documentation header $text .= "as\n"; my $as_RE = "AS"; # replace the first occurance of 'AS' after the + stored procedure definition select(NEW); # new default for print while (defined($line = <OLD>) ){ # chomp $line; if ($line =~ s/$as_RE/$text/) { chomp $line; $line .= <OLD>; redo unless eof(OLD); } print NEW $line; } close(OLD); close(NEW); rename($old_file, "$old_file.orig"); rename($new_file, $old_file); } [download] When I say garbage I mean something like the following... /**** Object: Stored Procedure dbo.a_sp_D_Add Script Date: 02/04 +/2002 17:00:18 **/ ਍⼀⨀⨀⨀⨀⨀⨀  +ऀഀഀ Marina Motchkina ਍          + ऀ　㐀⼀　㌀⼀&#12288 +;㈀ഀഀ ਍ऀ匀儀䰀 昀漀爀& +#8192;䐀砀砀砀 挀漀洀 +瀀漀渀攀渀琀猀ഀ&#3328 +; ***/ ਍ഀഀ CREATE PROCEDURE a_sp_D_Add ਍ऀ䀀吀愀戀氀攀一 +愀洀攀 瘀愀爀挀&#2662 +4;愀爀⠀㌀　⤀ഀഀ ਍ഀഀ as ਍ऀ搀攀挀氀愀爀攀 + 䀀猀琀爀匀儀䰀&#8192 +;瘀愀爀挀栀愀爀⠀&#13 +312;　　　⤀ഀഀ select @strSQL='SELECT ' + @TableName+'. '+ ' FROM ' + @TableName + +' WITH (NOLOCK) WHERE '+ @TableName+'.'+@TableName+'Counter =0' ਍ഀഀ ਍ഀഀ ਍ഀഀ GO ਍匀䔀吀 儀唀伀吀 +䔀䐀开䤀䐀䔀一吀&#186 +88;䘀䤀䔀刀 伀䘀䘀&#8 +192;ഀഀ GO ਍匀䔀吀 䄀一匀䤀 +开一唀䰀䰀匀 伀&#1996 +8; ഀഀ GO ਍ഀഀ [download] graff - you're right about posting unicode to the site - it looked nicer than this - but still garbage. Thanks anyway - I'm off to find out about 'ucs2' & 'utf16' & install the Unicode::String module.	[reply] [d/l] [select]
Re: Re: Problems with Unicode files generated with SQL Server by graff (Chancellor) on May 02, 2002 at 04:17 UTC
First, sorry if I used unicode terms in a confusing way earlier; ucs2 == utf16 (two names for same concept); this is most likely what is being used in your data, not utf-8 (be thankful). The data sample looks intriguing, and we can have some fun with that. Since utf16 (ucs2) codes are normally referred to in hex notation, let's produce this from the (decimal) numeric html entity references -- ie. convert `"&#(\d+);"` to the utf16 notation (4-digit hex numbers). A command-line script will suffice: `perl -pe 's/\&\#(\d+);/sprintf("%4.4x ",$1)/ge' < your.post` [download] Here's the first part of what we get: `0a0d 2f00 2a00 2a00 2a00 2a00 2a00 2a00 2000 0900 0d00 0d00 Marina Motchkina 0a0d 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 0900 3000 3400 +2f00 3000 3300 2f00 3000 3200 0d00 0d00 0a0d 0900 5300 5100 4c00 2000 6600 6f00 7200 2000 4400 7800 7800 7800 +2000 6300 6f00 6d00 7000 6f00 6e00 6500 6e00 7400 7300 0d00 0d00 ******/ ...` [download] This reveals a few things: (1) whatever you did to create those decimal numbers, it inverted the byte order of the original utf16 data -- we should be seeing "0053" instead of "5300". (2) Except for one oft-occurring value, the high byte is always null, which means that all these characters are really just ASCII, with the null byte added to turn them into utf16 (e.g. "2000", which is really "0020", is a "space"); (3) the one exceptional value, "0a0d", is of course the traditional MS-DOS/Win 2-byte line termination, viewed as a 16-bit value (but byte-swapped like the other codes). There are still some mysteries here, like: how is it that there are some standard ASCII (single-byte) characters mixed in with the utf16 stuff, and how would your script be able to handle both types of character data properly? Note that "true" utf16 data would have "000d 000a" as the line termination (using "logical" byte order), not "\r\n". Any chance you could show a hex dump of an original sample file, before the perl script trashed it? Moving on to your perl code, here are some initial reactions: `my $text .= concatenate_string_to_insert(); # documentation header %text .= "as\n";` [download] Are you sure that $text contains what you want at this point? (You didn't show what the "concatenate_...()" function does.) `while (defined($line = <OLD>) ) { ... chomp $line; $line .= <OLD>;` [download] This reliance on whatever "$/" happens to be takes me back to the earlier mystery: the input data seems to have utf16 and 8bit ASCII intermixed, but there must be some sort of record separator (not "\r\n") that delineates the two kinds of data in the stream. Find out what creates that delineation, so your script can use it, and read the file in units that contain just one sort of character data at a time. One last warning/challenge: with ASCII data that is converted to utf16 (by interleaving null bytes), and is then mangled by some unknown problem, the problem can be either unintended byte-swapping (as guessed above) OR unintended loss of a single byte at some point in the stream. Good luck.	[reply] [d/l] [select]