weston2010 has asked for the wisdom of the Perl Monks concerning the following question:

I am new to Perl and am having a very weird print issue.

The Perl program runs on Windows XP. It first executes a SQL then loops through the results and outputs to 5 files via 5 sub routines. The 5 files are to be loaded up to a database, so it uses `|` as the delimiter. The weird thing is sometimes the program outputs OK. Sometimes, the output is corrupted, e.g. line feed is missing after some point, or the values from array are not correct. I am wondering if it is something to do with memory. The output file sizes ranges from 500MB to 9GB. The program does read the output from SQL one record at a time and write one record at a time too. Here is the complete Perl script.

#!/usr/bin/perl use DBI; use DBD::Oracle; # Constants: use constant field0 => 0; use constant field1 => 1; use constant field2 => 2; use constant field3 => 3; use constant field4 => 4; use constant field5 => 5; use constant field6 => 6; use constant field7 => 7; use constant field8 => 8; use constant field9 => 9; use constant field10 => 10; use constant field11 => 11; use constant field12 => 12; use constant field13 => 13; use constant field14 => 14; use constant field15 => 15; use constant field16 => 16; use constant field17 => 17; use constant field18 => 18; use constant field19 => 19; use constant field20 => 20; use constant field21 => 21; use constant field22 => 22; use constant field23 => 23; use constant field24 => 24; use constant field25 => 25; use constant field26 => 26; use constant field27 => 27; use constant field28 => 28; use constant field29 => 29; use constant field30 => 30; use constant field31 => 31; use constant field32 => 32; use constant field33 => 33; use constant field34 => 34; use constant field35 => 35; use constant field36 => 36; use constant field37 => 37; use constant field38 => 38; use constant field39 => 39; use constant field40 => 40; use constant field41 => 41; # Capture Directory Path from Environment Variable: my $DIRECTORY = $ENV{DATADIR}; # Process Counters: my %fileCntr = ( ccr1 => 0, ccr2 => 0, ccr3 => 0, ccr4 => 0, ccr5 => 0 ); # Process Control Hashes: my %xref = (); # Process Control Variables: my $diag = 0; my $proc = 0; my $ndcc = 0; my $previous = ""; # Claims Extract array: my @arr = (); my $hdr = ""; # Accept/Parse DSS Connection String: $ENV{PSWD} =~ /(.+)\/(.+)\@(.+)/; my $USER = $1; my $PASS = $2; my $CONN = 'DBI:Oracle:' . $3; # ALTER Date format: my $ATL = qq(ALTER SESSION SET NLS_DATE_FORMAT = 'YYYY-MM-DD'); # Database Connection: my $dbh = DBI->connect( $CONN, $USER, $PASS, { RaiseError => 1, Au +toCommit => 0 } ); $dbh->do($ATL); # Execute ALTER session. my $SQL = qq( SELECT ... here is a big sql query ); # Open OUTPUT file for CCR processing: open OUT1, ">$DIRECTORY/ccr1.dat" or die "Unable to open OUT1 file +: $!\n"; open OUT2, ">$DIRECTORY/ccr2.dat" or die "Unable to open OUT2 file +: $!\n"; open OUT3, ">$DIRECTORY/ccr3.dat" or die "Unable to open OUT3 file +: $!\n"; open OUT4, ">$DIRECTORY/ccr4.dat" or die "Unable to open OUT4 file +: $!\n"; open OUT5, ">$DIRECTORY/ccr5.dat" or die "Unable to open OUT5 file +: $!\n"; # Redirect STDOUT to log file: open STDOUT, ">$DIRECTORY/ccr.log" or die "Unable to open LOG fi +le: $!\n"; # Prepare $SQL for execution: my $sth = $dbh->prepare($SQL); $sth->execute(); # Produce out files: { local $, = "|"; local $\ = "\n"; while (@arr = $sth->fetchrow_array) { # Direct Write of CCR1&2 records: &BuildCCR12(); # Write and Wipe CCR3 HASH Table: &WriteCCR3() unless ($arr[field0] == $previous); &BuildCCR3(); # Loop processing for CCR4: &BuildCCR4(); # Loop processing for CCR5: &BuildCCR5(); } } # Print Record Counts for OUTPUT files: foreach my $key (keys %fileCntr) { print "$key: " . $fileCntr{$key +} . "\n"; } # Terminate DB connection: $sth->finish(); $dbh->disconnect(); # Close all output files: close(OUT1); close(OUT2); close(OUT3); close(OUT4); close(OUT5); { # Reassign Output End-of-record across subroutine block: local $\ = "\n"; sub BuildCCR12 { # Write CCR1 Table: print OUT1 $arr[field6] . '|' . $arr[field7] . '|' . $arr[fie +ld5]; $fileCntr{ccr1}++; # Write CCR2 Table: unless ($arr[field17] eq '###########') { print OUT2 ++$ndcc . "|" . $arr[field0] . "|" . $arr[field6]; $fileCntr{ccr2}++; } } sub WriteCCR3 { unless ($previous == "") { # Produce ccr3 from DISTINCT combo listing: foreach $key (keys %xref) { print OUT3 $xref{$key}; $fileCntr{c +cr3}++; } %xref = (); } } sub BuildCCR3 { # Spin off relationship: for (my $i = field8; $i <= field13; $i++) { unless ($arr[$i] == -1) { $xref{$arr[field0] . "|" . $arr[$i]} = $arr[field0] . "|" . $a +rr[$i]; } } $previous = $arr[field0]; } sub BuildCCR4 { # Spin off relationship: for (my $i = field26; $i <= field37; $i++) { my $sak = $arr[field0] . $arr[field6] . $arr[field7] . $arr[$i] +; unless (($arr[$i] eq '#######') or ($arr[$i] eq '######')) { print OUT4 ++$diag . '|' . $arr[field0] . +'|' . $arr[field6] . '|' . $arr[field7] . '|' . $arr[$i]; $fileCntr{ccr4}++; } } } sub BuildCCR5 { # Spin off field0/Procedure relationship: for (my $i = field20; $i <= field23; $i++) { my $sak = $arr[field0] . $arr[field6] . $arr[field7] . $arr[$i] +; unless ($arr[$i] eq '######' or $arr[$i] eq '####') { print OUT5 ++$proc . '|' . $arr[field0] . '|' . +$arr[field6] . '|' . $arr[field7] . '|' . $arr[$i]; $fileCntr{ccr5}++; } } } }

The issue is with CCR3 output. After some point, the line feed disappears for some reason, and data got corrupted as if the line feed ate some of the output. Starting that point, it becomes 1 continuous line.

3260183|147845 3260183|78246 3260183|13898 3260183|184783 3260183|116315 3260183|184483262216|105843262217|1461703262217|175593262217|13603 +03262217
Another thing is this program will run close to 26 hours and while looping through the sql, is there any chance, the data can get messed up ? But it still won't explain why suddenly line feed does not work any more.

Replies are listed 'Best First'.
Re: corrupted print output
by davido (Cardinal) on Sep 30, 2011 at 06:53 UTC

    Well, if there is a suspicion that $\ is a problem (which seems unlikely, but nevertheless...), why not eliminate it as a potential issue by just explicitly using "\n" in your prints?

    I looked over your code and found a lot of little red flags here and there. Closing filehandles without checking return values (ie, without checking for errors). And doing a capturing pattern match without a check to ensure it matched. One that I looked at was your handling of $previous. It appears that the variable is supposed to hold a numeric value (you do several comparisons using ==). But then early on in the script you assign "$previous = """, and later you test, "if( $previous == "" )." Luckily an empty string will equate to zero in a numeric comparison, so asking if "" == "" yields the same results as asking if "" eq "" (both evaluate to true). But it still made me queasy, so I changed it to 'eq'. It may have been better to initially set $previous to undef, and then test if it's defined().

    I also made a few other changes: removed variables that were never used, set 'use autodie', eliminated some simple cases of "too much typing", and fixed your constants to be upper-cased for clarity. I reformatted the script to be easier to look at (often as I run through a script fixing its formatting I spot obvious errors that weren't so obvious when obscured by untidy formatting). I removed the useless '&' from your function calls, and eliminated the two assignments to $\. Your code also assigned to '$,', but none of your print statements printed lists (they all used concatenation with '|'), so $, wasn't doing anything for you. Removed.

    I wasn't able to run your script because I don't have the database that you're querying, and even if I did, you truncated the "big SQL query" anyway (and even if you hadn't, I probably wouldn't have run it). I did verify it compiles under strictures though (had to declare $key in one loop to get it to pass strictures).

    On a maintainability note, it's really unfortunate that none of the subs actually pass parameters. Everything is sort of absorbed through broader-scope osmosis. That contributes to hard-to-follow code, and the potential for bugs at a distance. If one function modifies a variable that is from a broader scope, and that variable is then used in another function, there's a side effect that took place, and it's hard to spot where. That sort of issue I didn't touch. I just left it the way it was.

    Another concern is all these FIELD1, FIELD2 and so on constants. Is it really the case that the fields are so uninteresting that they couldn't be given meaningful identifiers in your list of constants? How could you keep them all straight while writing this? Did you have a cross-reference chart hand written mapping FIELD27 to the BREAST_SIZE column of your database? (ok, I made that up.) Meaningful names are easier to work with most of the time. The same goes for OUT1, OUT2, OUT3... ccr1, ccr2, ccr3, BuildCCR4, BuildCCR5, WriteCCR12.... That's the clearest use of identifier names possible? If not, think about giving them names that someone six months from now looking at your code would be able to comprehend without the use of that crossreference chart. (My opinion, which may or may not be worth anything in this case.)

    So here's a somewhat modified version:

    If it still fails, you will know that the problem wasn't related to the use of $\. But maybe you'll be lucky and find that it works better now. If I broke something kindly just keep it to yourself (just kidding :)


    Dave

      Thank you Dave for going through the code in such details. Based on your suggestion, I have made one more modification, i.e. using 0 instead of "" when initializing $previous, since the $arrFIELD0 is numeric too. I will run the script and hopefully everything will be fine. :)
Re: corrupted print output
by Anonymous Monk on Sep 29, 2011 at 23:34 UTC

    Using use constant fRECORD_ID => 0; can be considered an improvement

    Using use constant field0 => 0; can be considered a mistake

Re: corrupted print output
by onelesd (Pilgrim) on Sep 29, 2011 at 21:52 UTC

    If you are running on Windows, shouldn't you set $\ = "\r\n"; in your code?

      No. CRLF is converted to LF on read, so CRLF would never match.

      \n is meant to be portable in Perl. Perl should make the right decision about what the underlying ascii value of \n should be. The sticky point for me has been in processing text files on one OS that have been generated on another OS that uses a different new line ascii combination.

      I have run into an issue with chomp in this case. String::Util offers a replacement for this case in fullchomp, though it's simple enough to write one yourself.

        But does Perl behave the same way when $\ is set (the default is undef)? I don't have a Windows machine to test on, but I would assume since he told Perl the newline should be a "\n" that Perl would respect his wishes.

        One thing OP: are you viewing your output file in Notepad? If so, try Wordpad and see how it looks.