BigRedEO has asked for the wisdom of the Perl Monks concerning the following question:

I have a program that is ALMOST doing what I want. I had a working program to scrub and format data before it gets loaded into a MySQL table, after working with "test" data that I was given in the form of a .CSV file. Worked just fine. Then I got the actual data and discovered there were quite a few error records in there - out of about 65,400 records, 272 of them were missing most of the data - most importantly, the three fields to create a Primary Key are all empty/blank. So I threw in what I figured would be an easy IF/ELSE - if those three fields were blank, write that record to an error file, else scrub the data. It does work in that it has separated out the error records from my "clean" file, but the error file is all wrong. The .CSV file I'm testing with has over 65,400 lines and the "clean" file is that amount minus 272 records. But my "error" file ends up with 8 million records in it! So I've got something I'm missing here and I was hoping another pair of eyes would catch it. (PS - no, I do not use the Text::CSV module. It is not available to me, nor will it be) Here are sample records from the .CSV file with the 4th record being one of the "error" records -
650096571,1,1,used as store paint,14,IFC 8012NP,Standalone-9,3596,56,1 +/31/2015,80813,A97W01251,,1/16/2015,0.25,0.25,,SW,CUSTOM MATCH,TRUE,O +,xts,,,,,,,1568,61006,1,FALSE 650368376,1,3,Tinted Wrong Color,16,IFC 8012NP,01DX8015206,,6,1/31/201 +5,160720,A87W01151,MATCH,1/31/2015,1,1,ENG,CUST,CUSTOM MATCH,TRUE,O,C +i52,,,,,,,1584,137252,1,FALSE 650175433,3,1,not tinted - e.w.,16,COROB MODULA HF,Standalone-7,,2,1/3 +1/2015,95555,B20W02651,,1/29/2015,3,3,,COMP,CUSTOM MATCH,TRUE,P,xts,, +,,,,,1627,68092,5,FALSE 650187016,2,1,checked out under cash ,,,,,,,,,,,,,,,,,,,,,,,,,,,,
And here is my script -
#!/usr/bin/perl/ use strict; use warnings; use Data::Dumper; use Time::Piece; my $filename = 'uncleaned.csv'; # Open input file open my $FH, $filename or die "Could not read from $filename <$!>, program halting."; # Open error handling file open ( my $ERR_FH, '>', "errorFiles.csv" ) or die $!; # Read the header line of the input file and print to screen. chomp(my $line = <$FH>); my @fields = split(/,/, $line); print Dumper(@fields), $/; my @data; # Read the lines one by one. while($line = <$FH>) { chomp($line); # Scrub data of characters that cause scripting problems down the line +. $line =~ s/[\'\\]/ /g; # split the fields of each record my @fields = split(/,/, $line); # Check if the storeNbr field is empty. If so, write record to error +file. if (!length $fields[28]) { print $ERR_FH join (',', @$_), $/ for @data; } else { # Concatenate the first three fields and add to the beginning of each +record unshift @fields, join '_', @fields[28..30]; # Format the DATE fields for MySQL $_ = join '-', (split /\//)[2,0,1] for @fields[10,14,24,26]; # Scrub colons from the data $line =~ s/:/ /g; # If Spectro_Model is "UNKNOWN", change if($fields[22] eq "UNKNOWN"){ $_ = 'UNKNOW' for $fields[22]; } # If tran_date is blank, insert 0000-00-00 if(!length $fields[10]){ $_ = '0000-00-00' for $fields[10]; } # If init_tran_date is blank, insert 0000-00-00 if(!length $fields[14]){ $_ = '0000-00-00' for $fields[14]; } # If update_tran_date is blank, insert 0000-00-00 if(!length $fields[24]){ $_ = '0000-00-00' for $fields[24]; } # If cancel_date is blank, insert 0000-00-00 if(!length $fields[26]){ $_ = '0000-00-00' for $fields[26]; } # Format the PROD_NBR field by deleting any leading zeros before decim +als. $fields[12] =~ s/^\s*0\././; # put the records back push @data, \@fields; } } close $FH; close $ERR_FH; print "Unsorted:\n", Dumper(@data); #, $/; #Sort the clean files on Primary Key, initTranDate, updateTranDate, an +d updateTranTime @data = sort { $a->[0] cmp $b->[0] || $a->[14] cmp $b->[14] || $a->[26] cmp $b->[26] || $a->[27] cmp $b-> [27] } @data; open my $OFH, '>', '/path/cleaned.csv'; print $OFH join(',', @$_), $/ for @data; close $OFH; exit;

Replies are listed 'Best First'.
Re: Parse one fiile, send the records to two different files
by BrowserUk (Patriarch) on May 27, 2016 at 16:29 UTC
    But my "error" file ends up with 8 million records in it! So I've got something I'm missing here and I was hoping another pair of eyes would catch it.

    You're printing the wrong array (@data instead of @fields) to your errors file.

    # split the fields of each record my @fields = split(/,/, $line); # Check if the storeNbr field is empty. If so, write record to er +ror file. if (!length $fields[28]) { print $ERR_FH join (',', @$_), $/ for @data; }

    @data is where you're accumulating your good records; and every time you find a bad record, you're writing all the good records you've accumulated so far into your errors file, instead of the single bad record.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Parse one fiile, send the records to two different files
by stevieb (Canon) on May 27, 2016 at 16:29 UTC

    Near the end, you have push @data, \@fields;, then earlier, you do: print $ERR_FH join (',', @$_), $/ for @data;, which from what I can tell, is repeatedly printing to the error file all elements of @data, which is being compounded (which is probably why you're seeing 8mil entries).

    Should that be:

    print $ERR_FH join(',', $_) for @fields;

    ?

      Thank you both - I was simply copying my last PRINT statement at the end of the script when I was adding the IF/ELSE statement. But now I've run into a new problem - I ran the script after making that simple change and it stopped at record 140, which is the first "error" record in the file (line 4 of my sample code in my original post) and is giving me this message, then stopping - Can't use string ("650187016") as an ARRAY ref while "strict refs" in use at AlterData.pl line 39, <$FH> line 140.

        Are you still using @$_ in your join statement? ie.

        print $ERR_FH join(',', @$_) for @fields; __^^^__

        If so, remove the @ so that you use the contents of $_ as-is, without attempting to dereference it as an array:

        print $ERR_FH join(',', $_) for @fields; __^^^__