in reply to Problems with Unicode files generated with SQL Server

Thanks guys, I'm using MS SQL Server 2000 Enterprise but I'm not connecting with the DBI - but with the MS DMO Library.
I'll try a few of those suggestions, I guess the question is kind of acedemic because I can script the procedures to ANSI - but as they were already stored in source-safe as unicode - i thought that a perl solution that handled unicode would be better (save the hassle of organising for all procedures to be checked in, etc)
On the version of unicode being used: I do not know - books on-line just say International(Unicode) "Select this option if the script uses special international characters that are supported only in the Unicode font."
Here's the PERL code
sub insert_the_string { my $old_file = shift; my $new_file = $old_file . "sql2"; my $line = ''; open(OLD, "< $old_file"); open(NEW, "> $new_file"); my $text .= concatenate_string_to_insert(); # documentation header $text .= "as\n"; my $as_RE = "AS"; # replace the first occurance of 'AS' after the + stored procedure definition select(NEW); # new default for print while (defined($line = <OLD>) ){ # chomp $line; if ($line =~ s/$as_RE/$text/) { chomp $line; $line .= <OLD>; redo unless eof(OLD); } print NEW $line; } close(OLD); close(NEW); rename($old_file, "$old_file.orig"); rename($new_file, $old_file); }
When I say garbage I mean something like the following...
/****** Object: Stored Procedure dbo.a_sp_D_Add Script Date: 02/04 +/2002 17:00:18 ******/ &#2573;&#12032;&#10752;&#10752;&#10752;&#10752;&#10752;&#10752;&#8192; +&#2304;&#3328;&#3328; Marina Motchkina &#2573;&#8192;&#8192;&#8192;&#8192;&#8192;&#8192;&#8192;&#8192;&#8192; +&#8192;&#2304;&#12288;&#13312;&#12032;&#12288;&#13056;&#12032;&#12288 +;&#12800;&#3328;&#3328; &#2573;&#2304;&#21248;&#20736;&#19456;&#8192;&#26112;&#28416;&#29184;& +#8192;&#17408;&#30720;&#30720;&#30720;&#8192;&#25344;&#28416;&#27904; +&#28672;&#28416;&#28160;&#25856;&#28160;&#29696;&#29440;&#3328;&#3328 +; ******/ &#2573;&#3328;&#3328; CREATE PROCEDURE a_sp_D_Add &#2573;&#2304;&#16384;&#21504;&#24832;&#25088;&#27648;&#25856;&#19968; +&#24832;&#27904;&#25856;&#8192;&#30208;&#24832;&#29184;&#25344;&#2662 +4;&#24832;&#29184;&#10240;&#13056;&#12288;&#10496;&#3328;&#3328; &#2573;&#3328;&#3328; as &#2573;&#2304;&#25600;&#25856;&#25344;&#27648;&#24832;&#29184;&#25856; +&#8192;&#16384;&#29440;&#29696;&#29184;&#21248;&#20736;&#19456;&#8192 +;&#30208;&#24832;&#29184;&#25344;&#26624;&#24832;&#29184;&#10240;&#13 +312;&#12288;&#12288;&#12288;&#10496;&#3328;&#3328; select @strSQL='SELECT ' + @TableName+'.* '+ ' FROM ' + @TableName + +' WITH (NOLOCK) WHERE '+ @TableName+'.'+@TableName+'Counter =0' &#2573;&#3328;&#3328; &#2573;&#3328;&#3328; &#2573;&#3328;&#3328; GO &#2573;&#21248;&#17664;&#21504;&#8192;&#20736;&#21760;&#20224;&#21504; +&#17664;&#17408;&#24320;&#18688;&#17408;&#17664;&#19968;&#21504;&#186 +88;&#17920;&#18688;&#17664;&#20992;&#8192;&#20224;&#17920;&#17920;&#8 +192;&#3328;&#3328; GO &#2573;&#21248;&#17664;&#21504;&#8192;&#16640;&#19968;&#21248;&#18688; +&#24320;&#19968;&#21760;&#19456;&#19456;&#21248;&#8192;&#20224;&#1996 +8;&#8192;&#3328;&#3328; GO &#2573;&#3328;&#3328;

graff - you're right about posting unicode to the site - it looked nicer than this - but still garbage.
Thanks anyway - I'm off to find out about 'ucs2' & 'utf16' & install the Unicode::String module.

Replies are listed 'Best First'.
Re: Re: Problems with Unicode files generated with SQL Server
by graff (Chancellor) on May 02, 2002 at 04:17 UTC
    First, sorry if I used unicode terms in a confusing way earlier; ucs2 == utf16 (two names for same concept); this is most likely what is being used in your data, not utf-8 (be thankful).

    The data sample looks intriguing, and we can have some fun with that. Since utf16 (ucs2) codes are normally referred to in hex notation, let's produce this from the (decimal) numeric html entity references -- ie. convert "&#(\d+);" to the utf16 notation (4-digit hex numbers). A command-line script will suffice:

    perl -pe 's/\&\#(\d+);/sprintf("%4.4x ",$1)/ge' < your.post
    Here's the first part of what we get:
    0a0d 2f00 2a00 2a00 2a00 2a00 2a00 2a00 2000 0900 0d00 0d00 Marina Motchkina 0a0d 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 0900 3000 3400 +2f00 3000 3300 2f00 3000 3200 0d00 0d00 0a0d 0900 5300 5100 4c00 2000 6600 6f00 7200 2000 4400 7800 7800 7800 +2000 6300 6f00 6d00 7000 6f00 6e00 6500 6e00 7400 7300 0d00 0d00 ******/ ...
    This reveals a few things: (1) whatever you did to create those decimal numbers, it inverted the byte order of the original utf16 data -- we should be seeing "0053" instead of "5300". (2) Except for one oft-occurring value, the high byte is always null, which means that all these characters are really just ASCII, with the null byte added to turn them into utf16 (e.g. "2000", which is really "0020", is a "space"); (3) the one exceptional value, "0a0d", is of course the traditional MS-DOS/Win 2-byte line termination, viewed as a 16-bit value (but byte-swapped like the other codes).

    There are still some mysteries here, like: how is it that there are some standard ASCII (single-byte) characters mixed in with the utf16 stuff, and how would your script be able to handle both types of character data properly? Note that "true" utf16 data would have "000d 000a" as the line termination (using "logical" byte order), not "\r\n".

    Any chance you could show a hex dump of an original sample file, before the perl script trashed it?

    Moving on to your perl code, here are some initial reactions:

    my $text .= concatenate_string_to_insert(); # documentation header %text .= "as\n";
    Are you sure that $text contains what you want at this point? (You didn't show what the "concatenate_...()" function does.)
    while (defined($line = <OLD>) ) { ... chomp $line; $line .= <OLD>;
    This reliance on whatever "$/" happens to be takes me back to the earlier mystery: the input data seems to have utf16 and 8bit ASCII intermixed, but there must be some sort of record separator (not "\r\n") that delineates the two kinds of data in the stream. Find out what creates that delineation, so your script can use it, and read the file in units that contain just one sort of character data at a time.

    One last warning/challenge: with ASCII data that is converted to utf16 (by interleaving null bytes), and is then mangled by some unknown problem, the problem can be either unintended byte-swapping (as guessed above) OR unintended loss of a single byte at some point in the stream. Good luck.