Data type validation using regular expressions

Category:	Miscellaneous
Author/Contact Info	www.dwoptimize.com
Description:	This is the first Perl script I ever wrote. Any input to make it better, or do differently (using some existing model, for example) highly appreciated. The simple perl script proof of concept demonstrated here, performs data type validation on the provided data file. The script creates an output file with bad data attributes substituted by default values. The data type specification and default values are read from a specification file. Usage example: Consider sales_payment.dat file: `A\|10.50\|CC\|2006/12/05\|10:05:15 2\|12A\|Cash\|2006/12/05\|10:12:18 3\|100\|12 Un\|2006/12/05\|10:15:23 4\|.85\|A1\|2006/12/05\|10:18:00 5\|-100\|B2\|2006/12/05\|10:20:00 6\|\|C\|2006/12/05\|10:22:00 7\|100\|\|2006/12/05\|10:26:00 8\|200\|D\|2006/02/31\|10:32:00 9\|2006/02/31\|10:33:00 10\|400\|E\|2006/03/40\|30:35:00 11\|400\|F\|1234\|10:41:AA 10\|300\|G\|2006/02/31\|10:05:15` [download] A specification file sales_payment.spec is created; the file contains metadata - data attribute name, attribute data type defined using regular expressions, and default data value that is used when the data file contains bad data - separated by commas (','): `transaction_number,^\d+$,-1 total_basket_amount,^[-+]?[0-9]*\.?[0-9]+$,0 payment_type,^\w$,_Unknown date,(19\|20)\d\d[/](0[1-9]\|1[012])[/](0[1-9]\|[12][0-9]\|3[01]),1900/01/ +01 time,^([0-1][0-9]\|[2][0-3]):([0-5][0-9]):([0-5][0-9])$,00:00:00` [download] When we run data validation: `unix> perl validate_data_type.pl "sales_payment.spec" "sales_payment.d +at" "sales_payment_out.dat" "sales_payment.log" 50` [download] ...the output data file is created: `-1\|10.50\|CC\|2006/12/05\|10:05:15 2\|0\|Cash\|2006/12/05\|10:12:18 3\|100\|_Unknown\|2006/12/05\|10:15:23 4\|.85\|A1\|2006/12/05\|10:18:00 5\|-100\|B2\|2006/12/05\|10:20:00 6\|0\|C\|2006/12/05\|10:22:00 7\|100\|_Unknown\|2006/12/05\|10:26:00 8\|200\|D\|2006/02/31\|10:32:00 10\|400\|E\|1900/01/01\|00:00:00 11\|400\|F\|1900/01/01\|00:00:00 10\|300\|G\|2006/02/31\|10:05:15` [download] ...along with a log file: Spec File> sales_payment.spec Data In File> sales_payment.dat Data Out File> sales_payment_out.dat Log File> sales_payment.log Max errors: 50 Error 1. Data type error on line: 1, attribute: 1 (transaction_number) Error 2. Data type error on line: 2, attribute: 2 (total_basket_amount +) Error 3. Data type error on line: 3, attribute: 3 (payment_type) Error 4. Data type error on line: 6, attribute: 2 (total_basket_amount +) Error 5. Data type error on line: 7, attribute: 3 (payment_type) Error 6. On the data line: 9, # attributes: 3, do not match # attribut +es in the file specification: 5 Error 7. Data type error on line: 10, attribute: 4 (date) Error 8. Data type error on line: 10, attribute: 5 (time) Error 9. Data type error on line: 11, attribute: 4 (date) Error 10. Data type error on line: 11, attribute: 5 (time) Process completed with: 10 errors Formatted documentation available at: <a href=http://www.dwoptimize.co +m/2007/05/data-type-validation-using-regular.html>www.dwoptimize.com< +/a> [download]
#!/usr/local/bin/perl -w # # Version 0.1 # http://www.dwoptimize.com/2007/05/data-type-validation-using-regular +.html # jag.singh@dwoptimize.com # # 1. Read specification file that defines the data file layout: # 1.1. attribute name # 1.2. attribute data type using regular expressions (http://www.perl. +com/pub/a/2000/11/begperl3.html) # 1.3. default data value # # 2. Validate data file for data type; replace file attribute data wit +h default value # if data type does not match specification # # 3. Create output data file with bad data values replaced by the defa +ult values # # 4. Create log file with results of data validation # # 5. Abort data validation process, if total number of errors reach ma +x_errors # ($spec_file, $data_in_file, $data_out_file, $log_file, $max_errors) = +@ARGV; # read command line parameters open (log_file, ">$log_file") or die "Can not open file $log_file, $!" +; open (spec_file, "$spec_file") or die "Can not open file $spec_file, $ +!"; open (data_in_file, "$data_in_file") or die "Can not open file $data_i +n_file, $!"; open (data_out_file, ">$data_out_file") or die "Can not open file $dat +a_out_file, $!"; print log_file "Spec File> ", $spec_file, "\n", "Data In File> ", $dat +a_in_file, "\n", "Data Out File> ", $data_out_file, "\n", "Log File> ", $log_file, "\ +n", "Max errors: ", $max_errors, "\n"; # foreach $spec_line () { # Read full data file specification into memor +y from the spec file, # this will be used for "lookup" chomp ($spec_line); # remove the newline from $spec_line. @spec_one_attribute = split(/\,/, $spec_line); # the spec file is ', +' delimited # @spec_one_attribute contains: attribute name, attribute data typ +e (regular expression), and default value push (@spec_all_attributes, [@spec_one_attribute]); # @spec_all_attr +ibutes contain the full data file specification } # $line_number = 1; $total_errors = 0; DATALINE: foreach $data_in_file () { # read data file, line by line chomp ($data_in_file); # remove the newline @data_in_attributes = split (/\\|/, $data_in_file); # the data file i +s '\|' delimited if ($#data_in_attributes != $#spec_all_attributes) { # number of attributes on the data line do not match with the spec +ification $total_errors++; print log_file "Error ", $total_errors, ". On the data line: ", $l +ine_number, ", # attributes: ", $#data_in_attributes + 1, ", do not match # attributes in the file specification: ", $#spe +c_all_attributes + 1, "\n"; last DATALINE if ($total_errors >= $max_errors); # terminate if to +o many errors next; # skip data attribute type validation } $attribute_position = 0; @data_out_attributes = (); foreach $attribute (@data_in_attributes) { if ($attribute =~ m/$spec_all_attributes[$attribute_position][1]/) + { # validate data attribute type by performing # lookup for the regular expression from the spec memory structu +re push (@data_out_attributes, $attribute); # Correct data type, th +e output value is same as input value } else { push (@data_out_attributes, $spec_all_attributes[$attribute_posi +tion][2]); # Bad data type, use default provided in the spec for output v +alue $total_errors++; print log_file "Error ", $total_errors, ". Data type error on li +ne: ", $line_number, ", attribute: ", $attribute_position + 1, " (", $spec_all_attr +ibutes[$attribute_position][0], ")\n"; } last DATALINE if ($total_errors >= $max_errors); # terminate if to +o many errors $attribute_position++; } print data_out_file join ("\|", @data_out_attributes), "\n"; # the da +ta out file is '\|' delimited } continue { # update line number counter even if the data attribute t +ype validation is skipped $line_number++; } # if ($total_errors >= $max_errors) { print log_file "Max error count reached: ", $total_errors, ", proces +s terminated\n"; } else { print log_file "Process completed with: ", $total_errors, " errors\n +"; } # End

Back to Code Catacombs