comment on

I have a program that is ALMOST doing what I want. I had a working program to scrub and format data before it gets loaded into a MySQL table, after working with "test" data that I was given in the form of a .CSV file. Worked just fine. Then I got the actual data and discovered there were quite a few error records in there - out of about 65,400 records, 272 of them were missing most of the data - most importantly, the three fields to create a Primary Key are all empty/blank. So I threw in what I figured would be an easy IF/ELSE - if those three fields were blank, write that record to an error file, else scrub the data. It does work in that it has separated out the error records from my "clean" file, but the error file is all wrong. The .CSV file I'm testing with has over 65,400 lines and the "clean" file is that amount minus 272 records. But my "error" file ends up with 8 million records in it! So I've got something I'm missing here and I was hoping another pair of eyes would catch it. (PS - no, I do not use the Text::CSV module. It is not available to me, nor will it be) Here are sample records from the .CSV file with the 4th record being one of the "error" records -

650096571,1,1,used as store paint,14,IFC 8012NP,Standalone-9,3596,56,1
+/31/2015,80813,A97W01251,,1/16/2015,0.25,0.25,,SW,CUSTOM MATCH,TRUE,O
+,xts,,,,,,,1568,61006,1,FALSE
650368376,1,3,Tinted Wrong Color,16,IFC 8012NP,01DX8015206,,6,1/31/201
+5,160720,A87W01151,MATCH,1/31/2015,1,1,ENG,CUST,CUSTOM MATCH,TRUE,O,C
+i52,,,,,,,1584,137252,1,FALSE
650175433,3,1,not tinted - e.w.,16,COROB MODULA HF,Standalone-7,,2,1/3
+1/2015,95555,B20W02651,,1/29/2015,3,3,,COMP,CUSTOM MATCH,TRUE,P,xts,,
+,,,,,1627,68092,5,FALSE
650187016,2,1,checked out under cash ,,,,,,,,,,,,,,,,,,,,,,,,,,,,
[download]

And here is my script -

#!/usr/bin/perl/

use strict;
use warnings;
use Data::Dumper;
use Time::Piece;

my $filename = 'uncleaned.csv';

# Open input file
open my $FH, $filename
  or die "Could not read from $filename <$!>, program halting.";

# Open error handling file
open ( my $ERR_FH, '>', "errorFiles.csv" ) or die $!;

# Read the header line of the input file and print to screen.
chomp(my $line = <$FH>);
my @fields = split(/,/, $line);
print Dumper(@fields), $/;

my @data;

# Read the lines one by one.
while($line = <$FH>) {

    chomp($line);

# Scrub data of characters that cause scripting problems down the line
+.
    $line =~ s/[\'\\]/ /g;

# split the fields of each record

    my @fields = split(/,/, $line);

# Check if the storeNbr field is empty.  If so, write record to error 
+file.
    if (!length $fields[28]) {
        print $ERR_FH join (',', @$_), $/ for @data;
        }
    else
    {

# Concatenate the first three fields and add to the beginning of each 
+record
    unshift @fields, join '_', @fields[28..30];

# Format the DATE fields for MySQL
    $_ = join '-', (split /\//)[2,0,1] for @fields[10,14,24,26];

# Scrub colons from the data
    $line =~ s/:/ /g;

# If Spectro_Model is "UNKNOWN", change
    if($fields[22] eq "UNKNOWN"){
        $_ = 'UNKNOW' for $fields[22];
        }

# If tran_date is blank, insert 0000-00-00
    if(!length $fields[10]){
        $_ = '0000-00-00' for $fields[10];
        }

# If init_tran_date is blank, insert 0000-00-00
    if(!length $fields[14]){
        $_ = '0000-00-00' for $fields[14];
        }

# If update_tran_date is blank, insert 0000-00-00
    if(!length $fields[24]){
        $_ = '0000-00-00' for $fields[24];
        }

# If cancel_date is blank, insert 0000-00-00
    if(!length $fields[26]){
        $_ = '0000-00-00' for $fields[26];
        }

# Format the PROD_NBR field by deleting any leading zeros before decim
+als.
    $fields[12] =~ s/^\s*0\././;

# put the records back
    push @data, \@fields;
}
}

close $FH;
close $ERR_FH;

print "Unsorted:\n", Dumper(@data); #, $/;

#Sort the clean files on Primary Key, initTranDate, updateTranDate, an
+d updateTranTime
@data = sort {
    $a->[0] cmp $b->[0] ||
    $a->[14] cmp $b->[14] ||
    $a->[26] cmp $b->[26] ||
    $a->[27] cmp $b-> [27]
} @data;

open my $OFH, '>', '/path/cleaned.csv';
print $OFH join(',', @$_), $/ for @data;
close $OFH;

exit;
[download]

In reply to Parse one file, send the records to two different files by BigRedEO

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.