Problem with larger files (and s/)

Cloudster has asked for the wisdom of the Perl Monks concerning the following question:

A boon I crave, oh wise monks! Be merciful upon this newb supplicant!

I wrote my first Perl program a couple of weeks ago and am quite happy with it (now I just have to figure out how to link it in to a web page and my PHP chat program, but I’ll figure that out eventually.) My job is database administration and development. Yesterday I sat down to do something that I do several times a week, reformat a script generated by SQL Server 2000’s Enterprise Manager when I realized that this was another opportunity to learn more of Perl and save quite a bit of time in the future, so I set forth and have fallen flat on my face. (I wrote most of this post this morning, I’ve since made most of it work)

Here’s what a SQL Server script looks like, this would be saved in a file:

CREATE TABLE [dbo].[CustomerBalances] (
    [arbh_acct] [int] NOT NULL ,
    [arbh_prop_code] [varchar] (30) COLLATE SQL_Latin1_General_CP1_CI_
+AS NOT NULL ,
    [arbh_unpd_bal] [decimal](10, 2) NOT NULL ,
    [rCount] [tinyint] NOT NULL ,
    [arbh_ar_cat] [tinyint] NULL 
) ON [PRIMARY]
GO
[download]

What I want to do is remove the square brackets, remove the COLLATE blah blah blah, insert three tabs in front of the data type, and insert a couple of tabs in front of the null option, etcetera. The tabs wouldn’t align perfectly in the end because of variance in the field name length, but that’s ok, TextPad is excellent for making that easy.

I’ve got most of that working. I have two final problems, both of which are beyond my skill. First, I can’t get the NULL/NOT NULL to parse correctly. I end up with NOT\t\tNULL.

Second, and most critical, is file size. If I’m dealing with a dinky little file like the above (300 bytes), it runs just fine. But if I feed it a 9k script file with 200 lines of code, I get a really weird result. The output file displays like it has an additional space between every character, there are lots of non-ASCII values in the file, and nothing has matched and been reformatted.

I have no idea what’s going on with my program. My intent was that a line would be read, tested to see if it contains a space and a data type name, and if it did, replace that space with three tabs. My result file is very badly mangled, and no longer ASCII. Obviously I’m missing something.

This is what the first part of the file looks like when it’s done processing:

CREATE TABLE 搀戀漀 .䌀甀猀琀漀洀攀爀䈀愀氀愀渀挀攀猀 ( ਍ऀ arbh_acct int 一伀吀一唀䰀䰀 Ⰰഀഀ 愀爀戀栀开瀀爀漀瀀开挀漀搀攀瘀愀爀挀栀愀爀 (30) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL , ਍ऀ arbh_unpd_bal decimal⠀㄀　Ⰰ ㈀⤀ 一伀吀一唀䰀䰀 Ⰰഀഀ 爀䌀漀甀渀琀琀椀渀礀椀渀琀 NOT NULL , ਍ऀ arbh_ar_cat tinyint 一伀吀一唀䰀䰀 ഀഀ ) ON 倀刀䤀䴀䄀刀夀 ਍䜀伀ഀഀ ਍䌀刀䔀䄀吀䔀吀䄀䈀䰀䔀 dbo⸀ EmailText ⠀ഀഀ 䴀攀猀猀愀最攀䈀漀搀礀瘀愀爀挀栀愀爀 (255) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL , ਍ऀ InsertSequence bigint 䤀䐀䔀一吀䤀吀夀 ⠀㄀Ⰰ ㄀⤀ 一伀吀一唀䰀䰀 ഀഀ ) ON 倀刀䤀䴀䄀刀夀 ਍䜀伀ഀഀ ਍䌀刀䔀䄀吀䔀吀䄀䈀䰀䔀 dbo⸀ ErrorCodes ⠀ഀഀ 䔀爀爀漀爀䌀漀搀攀猀洀愀氀氀椀渀琀 NOT NULL , ਍ऀ ErrorDesc varchar ⠀㐀　⤀ 䌀伀䰀䰀䄀吀䔀匀儀䰀开䰀愀琀椀渀㄀开䜀攀渀攀爀愀氀开䌀倀㄀开䌀䤀开䄀匀一伀吀一唀䰀䰀 Ⰰഀഀ 䔀爀爀漀爀䰀漀挀愀琀椀漀渀挀栀愀爀 (1) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL ਍⤀ 伀一 PRIMARYഀഀ GO ਍ഀഀ

I’ve looked at the source file, and it’s definitely ASCII text, not Unicode. Here’s my program, I was trying to have all of the data types in one array and work it from that angle, but that’s beyond my skill right now:

print "\nThis program reformats scripts produced by SQL Server 2000 En
+terprise Manager\n";
print "to remove brackets and tab out data types and null settings.\n\
+n";
print "You provide a file name, this program reads it and produces a n
+ew file\n";
print "with a .out extension.\n\n";
print "File name to process? (<enter> to end program.) ";

chomp($sqlfile = <stdin>);

$outfile = $sqlfile . ".out";

$datatypes = " binary/ bigint/ bit/ char/ datetime/ "
    . " decimal/ float/ image/ int/ money/ "
    . " nchar/ ntext/ nvarchar/ numeric/ real/ "
    . " smalldatetime/ smallint/ smallmoney/ sql_variant/ "
    . " sysname/ text/ timestamp/ tinyint/ varbinary/ varchar/ "
    . " uniqueidentifier";

open(IN,  $sqlfile)    || die "cannot open $sqlfile for input: $!";
open(OUT, ">$outfile") || die "cannot open $outfile for output: $!";

while (<IN>) {
        chomp;

    #delete or comment out the two following lines
    #if you need brackets around your object names.
    s/\[//g;
    s/\]//g;
    s/ \(/\(/g;
    s/ \,/\,/g;
    s/COLLATE SQL_Latin1_General_CP1_CI_AS//g;
    s/ON PRIMARY//g;

    s/\sbinary/\t\t\tbinary/g;
    s/\sbigint/\t\t\tbigint/g;
    s/\sbit/\t\t\tbit/g;
    s/\schar/\t\t\tchar/g;
    s/\sdatetime/\t\t\tdatetime/g;
    s/\sdecimal/\t\t\tdecimal/g;
    s/\sfloat/\t\t\tfloat/g;
    s/\simage/\t\t\timage/g;
    s/\sint/\t\t\tint/g;
    s/\smoney/\t\t\tmoney/g;
    s/\snchar/\t\t\tnchar/g;
    s/\sntext/\t\t\tntext/g;
    s/\snvarchar/\t\t\tnvarchar/g;
    s/\snumeric/\t\t\tnumeric/g;
    s/\sreal/\t\t\treal/g;
    s/\ssmalldatetime/\t\t\tsmalldatetime/g;
    s/\ssmallint/\t\t\tsmallint/g;
    s/\ssmallmoney/\t\t\tsmallmoney/g;
    s/\ssql_variant/\t\t\tsql_variant/g;
    s/\ssysname/\t\t\tsysname/g;
    s/\stext/\t\t\ttext/g;
    s/\stimestamp/\t\t\ttimestamp/g;
    s/\stinyint/\t\t\ttinyint/g;
    s/\svarbinary/\t\t\tvarbinary/g;
    s/\svarchar/\t\t\tvarchar/g;
    s/\suniqueidentifier/\t\t\tuniqueidentifier/g;

    s/NOT NULL/\t\tNOT NULL/s;
    s/NULL/\t\tNULL/s;

    print "$_\n";
    print OUT "$_\n";
}

close(OUT) || die "problem closing new $outfile: $!";
close(IN)  || die "problem closing original $sqlfile: $!";
[download]

Any suggestions would be most welcome. Yes, it's a rather brute-force approach, but I'm new to Perl and it does what I want it to (mostly).

Comment on Problem with larger files (and s/) Select or Download Code

Replies are listed 'Best First'.
Re: Problem with larger files (and s/) by jds17 (Pilgrim) on Jun 24, 2008 at 22:44 UTC
Hi, your code did not output strange signs when running it, I would suspect there is something wrong with the input file, but cannot tell without further information. I would like to focus on your other problem (inserting the tabs in some places, not in others). My suggestion would be as follows, for completeness I include big parts of what you already have written. It's late so the code is not completely polished, e.g. the first two substitutions should be written with inline comments and there may be more elegant ways to write some of the regexes, but here it is: print "\nThis program reformats scripts produced by SQL Server 2000 En +terprise Manager\n"; print "to remove brackets and tab out data types and null settings.\n\ +n"; print "You provide a file name, this program reads it and produces a n +ew file\n"; print "with a .out extension.\n\n"; print "File name to process? (<enter> to end program.) "; chomp($sqlfile = <stdin>); $outfile = $sqlfile . ".out"; open(IN, $sqlfile) \|\| die "cannot open $sqlfile for input: $!"; open(OUT, ">$outfile") \|\| die "cannot open $outfile for output: $!"; while (<IN>) { #remove square brackets s/(\[\|\])//g; #remove whitespace before round brackets... s/\s+(($\|$))/$1/g; #...and commas s/\s+,/,/g; #remove some keywords s/COLLATE SQL_Latin1_General_CP1_CI_AS//g; s/ON PRIMARY//g; #remove duplicate whitespace s/\s+(\s)/$1/g; #THE MOST INTERESTING PART: #For lines not starting with non-whitespace (should hopefully be #the case only for the first line, otherwise you have to track #the line number lest you analyze keywords): #replace (single) whitespace character before word by three #tabs in case the following expression is neither "NULL" #nor "NOT NULL" s/\s+(?!(?:NOT )?NULL)([a-zA-Z]\w*)/\t\t\t$1/g if !/^\S/; print; print OUT; } END { close OUT \|\| die "problem closing new $outfile: $!"; close IN \|\| die "problem closing original $sqlfile: $!"; } [download] Some comments: 1. The most interesting part is the negative lookahead used for inserting the three tabs in the places described above. 2. You do not need to chomp the input lines since otherwise you need to add newlines afterwards again. Hope this helps a bit and gave you some new ideas.	[reply] [d/l]
Re^2: Problem with larger files (and s/) by Cloudster (Novice) on Jun 25, 2008 at 14:57 UTC
Thank you! Very interesting code, I look forward to studying and dissecting it.	[reply]
Re: Problem with larger files (and s/) by moritz (Cardinal) on Jun 24, 2008 at 22:12 UTC
`open(IN, $sqlfile) \|\| die "cannot open $sqlfile for input: $!"; open(OUT, ">$outfile") \|\| die "cannot open $outfile for output: $!";` [download] If you are sure that the files are all ASCII, try this instead: `open(IN, '<:encoding(ASCII)', $sqlfile) or die "cannot open $sqlfile f +or input: $!"; open(OUT, '>:encoding(ASCII)', $outfile or die "cannot open file $outf +ile for writing: $!";` [download] Are you really sure that the files are in ASCII? which editor did you open it with? Try to inspect the files with a hex editor/dumper and see if it's really ASCII.	[reply] [d/l] [select]
Re^2: Problem with larger files (and s/) by Cloudster (Novice) on Jun 25, 2008 at 14:51 UTC
I was certain that the file was ASCII. Correction, I was absolutely certain that it was ASCII. It wasn't. There are two ways of generating scripts in SQL Server 2000's Enterprise Manager. The script generator can save them to a file, or preview them and you can copy/paste into whatever you want. This is the route that I normally go. I did not notice that the final option screen let you specify OEM, ANSI or Unicode, and it defaults to Unicode. facepalm bang head against wall rinse, repeat I couldn't find a hex mode in TextPad, turns out that you specify a binary file type in the Open dialog, and then it quite obviously was Unicode.	[reply]
Re: Problem with larger files (and s/) by jethro (Monsignor) on Jun 24, 2008 at 22:40 UTC
The NOT\t\tNULL issue is trivially corrected with `if (not s/NOT NULL/\t\tNOT NULL/s) { s/NULL/\t\tNULL/s; }` [download] (at least I hope so, didn't test it). The following regex works too: `s/(NOT\s+)?NULL/\t\t$1NULL/s` [download] To find the solution to the other problem I would suggest loading that 9k file into an editor (not word, not wordpad, but you probably know that), cut it by half and feeding it to your program again. If the problem is still there when it is 300 bytes long you know it isn't because of size, but probably the wrong file format. If not, put another print statement before all the regexp to see if perl reads the file incorrectly or mangles it in your code (which to me looks perfectly ok).	[reply] [d/l] [select]
Re^2: Problem with larger files (and s/) by Cloudster (Novice) on Jun 25, 2008 at 14:54 UTC
Thank you very much for your trivial correction, it was spot-on. I'm learning Perl via an older copy of O'Reilly's Perl CD collection, and had not yet seen an example of using /s as an if test. Works like a charm. As far as the Unicode problem, see my reply to the previous reply. facepalm again	[reply]
Re^3: Problem with larger files (and s/) by jethro (Monsignor) on Jun 25, 2008 at 16:39 UTC
Sorry, the 'trivially' wasn't meant as in 'trivial problem' but more like 'less sophisticated solution', i.e. without packing it into one regex or using negative lookahead (which I wanted to mention, but I couldn't remember the name, which was ample proof to me that it is a non-trivial solution). ;-)	[reply]
Re^4: Problem with larger files (and s/) by Cloudster (Novice) on Jul 02, 2008 at 16:26 UTC