sourcecode
Anonymous Monk
<code>
#!/usr/local/bin/perl -w
#
# Version 0.1
# http://www.dwoptimize.com/2007/05/data-type-validation-using-regular.html
# jag.singh@dwoptimize.com
#
# 1. Read specification file that defines the data file layout:
# 1.1. attribute name
# 1.2. attribute data type using regular expressions (http://www.perl.com/pub/a/2000/11/begperl3.html)
# 1.3. default data value
#
# 2. Validate data file for data type; replace file attribute data with default value
# if data type does not match specification
#
# 3. Create output data file with bad data values replaced by the default values
#
# 4. Create log file with results of data validation
#
# 5. Abort data validation process, if total number of errors reach max_errors
#
($spec_file, $data_in_file, $data_out_file, $log_file, $max_errors) = @ARGV; # read command line parameters
open (log_file, ">$log_file") or die "Can not open file $log_file, $!";
open (spec_file, "$spec_file") or die "Can not open file $spec_file, $!";
open (data_in_file, "$data_in_file") or die "Can not open file $data_in_file, $!";
open (data_out_file, ">$data_out_file") or die "Can not open file $data_out_file, $!";
print log_file "Spec File> ", $spec_file, "\n", "Data In File> ", $data_in_file, "\n",
"Data Out File> ", $data_out_file, "\n", "Log File> ", $log_file, "\n", "Max errors: ", $max_errors, "\n";
#
foreach $spec_line () { # Read full data file specification into memory from the spec file,
# this will be used for "lookup"
chomp ($spec_line); # remove the newline from $spec_line.
@spec_one_attribute = split(/\,/, $spec_line); # the spec file is ',' delimited
# @spec_one_attribute contains: attribute name, attribute data type (regular expression), and default value
push (@spec_all_attributes, [@spec_one_attribute]); # @spec_all_attributes contain the full data file specification
}
#
$line_number = 1; $total_errors = 0;
DATALINE: foreach $data_in_file () { # read data file, line by line
chomp ($data_in_file); # remove the newline
@data_in_attributes = split (/\|/, $data_in_file); # the data file is '|' delimited
if ($#data_in_attributes != $#spec_all_attributes) {
# number of attributes on the data line do not match with the specification
$total_errors++;
print log_file "Error ", $total_errors, ". On the data line: ", $line_number,
", # attributes: ", $#data_in_attributes + 1,
", do not match # attributes in the file specification: ", $#spec_all_attributes + 1, "\n";
last DATALINE if ($total_errors >= $max_errors); # terminate if too many errors
next; # skip data attribute type validation
}
$attribute_position = 0; @data_out_attributes = ();
foreach $attribute (@data_in_attributes) {
if ($attribute =~ m/$spec_all_attributes[$attribute_position][1]/) { # validate data attribute type by performing
# lookup for the regular expression from the spec memory structure
push (@data_out_attributes, $attribute); # Correct data type, the output value is same as input value
} else {
push (@data_out_attributes, $spec_all_attributes[$attribute_position][2]);
# Bad data type, use default provided in the spec for output value
$total_errors++;
print log_file "Error ", $total_errors, ". Data type error on line: ", $line_number,
", attribute: ", $attribute_position + 1, " (", $spec_all_attributes[$attribute_position][0], ")\n";
}
last DATALINE if ($total_errors >= $max_errors); # terminate if too many errors
$attribute_position++;
}
print data_out_file join ("|", @data_out_attributes), "\n"; # the data out file is '|' delimited
} continue { # update line number counter even if the data attribute type validation is skipped
$line_number++;
}
#
if ($total_errors >= $max_errors) {
print log_file "Max error count reached: ", $total_errors, ", process terminated\n";
} else {
print log_file "Process completed with: ", $total_errors, " errors\n";
}
# End
</code>
This is the first Perl script I ever wrote. Any input to make it better, or do differently (using some existing model, for example) highly appreciated.
<p>
The simple perl script proof of concept demonstrated here, performs data type validation on the provided data file. The script creates an output file with bad data attributes substituted by default values. The data type specification and default values are read from a specification file.
<p>
Usage example:
<p>
Consider sales_payment.dat file:
<c>
A|10.50|CC|2006/12/05|10:05:15
2|12A|Cash|2006/12/05|10:12:18
3|100|12 Un|2006/12/05|10:15:23
4|.85|A1|2006/12/05|10:18:00
5|-100|B2|2006/12/05|10:20:00
6||C|2006/12/05|10:22:00
7|100||2006/12/05|10:26:00
8|200|D|2006/02/31|10:32:00
9|2006/02/31|10:33:00
10|400|E|2006/03/40|30:35:00
11|400|F|1234|10:41:AA
10|300|G|2006/02/31|10:05:15
</c>
A specification file sales_payment.spec is created; the file contains metadata - data attribute name, attribute data type defined using regular expressions, and default data value that is used when the data file contains bad data - separated by commas (','):
<c>
transaction_number,^\d+$,-1
total_basket_amount,^[-+]?[0-9]*\.?[0-9]+$,0
payment_type,^\w$,_Unknown
date,(19|20)\d\d[/](0[1-9]|1[012])[/](0[1-9]|[12][0-9]|3[01]),1900/01/01
time,^([0-1][0-9]|[2][0-3]):([0-5][0-9]):([0-5][0-9])$,00:00:00
</c>
When we run data validation:
<c>
unix> perl validate_data_type.pl "sales_payment.spec" "sales_payment.dat" "sales_payment_out.dat" "sales_payment.log" 50
</c>
...the output data file is created:
<c>
-1|10.50|CC|2006/12/05|10:05:15
2|0|Cash|2006/12/05|10:12:18
3|100|_Unknown|2006/12/05|10:15:23
4|.85|A1|2006/12/05|10:18:00
5|-100|B2|2006/12/05|10:20:00
6|0|C|2006/12/05|10:22:00
7|100|_Unknown|2006/12/05|10:26:00
8|200|D|2006/02/31|10:32:00
10|400|E|1900/01/01|00:00:00
11|400|F|1900/01/01|00:00:00
10|300|G|2006/02/31|10:05:15
</c>
...along with a log file:
<c>
Spec File> sales_payment.spec
Data In File> sales_payment.dat
Data Out File> sales_payment_out.dat
Log File> sales_payment.log
Max errors: 50
Error 1. Data type error on line: 1, attribute: 1 (transaction_number)
Error 2. Data type error on line: 2, attribute: 2 (total_basket_amount)
Error 3. Data type error on line: 3, attribute: 3 (payment_type)
Error 4. Data type error on line: 6, attribute: 2 (total_basket_amount)
Error 5. Data type error on line: 7, attribute: 3 (payment_type)
Error 6. On the data line: 9, # attributes: 3, do not match # attributes in the file specification: 5
Error 7. Data type error on line: 10, attribute: 4 (date)
Error 8. Data type error on line: 10, attribute: 5 (time)
Error 9. Data type error on line: 11, attribute: 4 (date)
Error 10. Data type error on line: 11, attribute: 5 (time)
Process completed with: 10 errors
Formatted documentation available at: <a href=http://www.dwoptimize.com/2007/05/data-type-validation-using-regular.html>www.dwoptimize.com</a>
</c>
Miscellaneous
<a href=http://www.dwoptimize.com/2007/05/data-type-validation-using-regular.html>www.dwoptimize.com</a>
-7