Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Data type validation using regular expressions

by Anonymous Monk
on May 26, 2007 at 20:17 UTC ( [id://617692]=sourcecode: print w/replies, xml ) Need Help??
Category: Miscellaneous
Author/Contact Info www.dwoptimize.com
Description: This is the first Perl script I ever wrote. Any input to make it better, or do differently (using some existing model, for example) highly appreciated.

The simple perl script proof of concept demonstrated here, performs data type validation on the provided data file. The script creates an output file with bad data attributes substituted by default values. The data type specification and default values are read from a specification file.

Usage example:

Consider sales_payment.dat file:

A|10.50|CC|2006/12/05|10:05:15 2|12A|Cash|2006/12/05|10:12:18 3|100|12 Un|2006/12/05|10:15:23 4|.85|A1|2006/12/05|10:18:00 5|-100|B2|2006/12/05|10:20:00 6||C|2006/12/05|10:22:00 7|100||2006/12/05|10:26:00 8|200|D|2006/02/31|10:32:00 9|2006/02/31|10:33:00 10|400|E|2006/03/40|30:35:00 11|400|F|1234|10:41:AA 10|300|G|2006/02/31|10:05:15
A specification file sales_payment.spec is created; the file contains metadata - data attribute name, attribute data type defined using regular expressions, and default data value that is used when the data file contains bad data - separated by commas (','):
transaction_number,^\d+$,-1 total_basket_amount,^[-+]?[0-9]*\.?[0-9]+$,0 payment_type,^\w$,_Unknown date,(19|20)\d\d[/](0[1-9]|1[012])[/](0[1-9]|[12][0-9]|3[01]),1900/01/ +01 time,^([0-1][0-9]|[2][0-3]):([0-5][0-9]):([0-5][0-9])$,00:00:00
When we run data validation:
unix> perl validate_data_type.pl "sales_payment.spec" "sales_payment.d +at" "sales_payment_out.dat" "sales_payment.log" 50
...the output data file is created:
-1|10.50|CC|2006/12/05|10:05:15 2|0|Cash|2006/12/05|10:12:18 3|100|_Unknown|2006/12/05|10:15:23 4|.85|A1|2006/12/05|10:18:00 5|-100|B2|2006/12/05|10:20:00 6|0|C|2006/12/05|10:22:00 7|100|_Unknown|2006/12/05|10:26:00 8|200|D|2006/02/31|10:32:00 10|400|E|1900/01/01|00:00:00 11|400|F|1900/01/01|00:00:00 10|300|G|2006/02/31|10:05:15
...along with a log file:
Spec File> sales_payment.spec Data In File> sales_payment.dat Data Out File> sales_payment_out.dat Log File> sales_payment.log Max errors: 50 Error 1. Data type error on line: 1, attribute: 1 (transaction_number) Error 2. Data type error on line: 2, attribute: 2 (total_basket_amount +) Error 3. Data type error on line: 3, attribute: 3 (payment_type) Error 4. Data type error on line: 6, attribute: 2 (total_basket_amount +) Error 5. Data type error on line: 7, attribute: 3 (payment_type) Error 6. On the data line: 9, # attributes: 3, do not match # attribut +es in the file specification: 5 Error 7. Data type error on line: 10, attribute: 4 (date) Error 8. Data type error on line: 10, attribute: 5 (time) Error 9. Data type error on line: 11, attribute: 4 (date) Error 10. Data type error on line: 11, attribute: 5 (time) Process completed with: 10 errors Formatted documentation available at: <a href=http://www.dwoptimize.co +m/2007/05/data-type-validation-using-regular.html>www.dwoptimize.com< +/a>
#!/usr/local/bin/perl -w
#
# Version 0.1
# http://www.dwoptimize.com/2007/05/data-type-validation-using-regular
+.html
# jag.singh@dwoptimize.com
#
# 1. Read specification file that defines the data file layout:
# 1.1. attribute name
# 1.2. attribute data type using regular expressions (http://www.perl.
+com/pub/a/2000/11/begperl3.html)
# 1.3. default data value 
#
# 2. Validate data file for data type; replace file attribute data wit
+h default value
#    if data type does not match specification
#
# 3. Create output data file with bad data values replaced by the defa
+ult values
#
# 4. Create log file with results of data validation
#
# 5. Abort data validation process, if total number of errors reach ma
+x_errors
#
($spec_file, $data_in_file, $data_out_file, $log_file, $max_errors) = 
+@ARGV; # read command line parameters
open (log_file, ">$log_file") or die "Can not open file $log_file, $!"
+;
open (spec_file, "$spec_file") or die "Can not open file $spec_file, $
+!";
open (data_in_file, "$data_in_file") or die "Can not open file $data_i
+n_file, $!";
open (data_out_file, ">$data_out_file") or die "Can not open file $dat
+a_out_file, $!";
print log_file "Spec File> ", $spec_file, "\n", "Data In File> ", $dat
+a_in_file, "\n", 
  "Data Out File> ", $data_out_file, "\n", "Log File> ", $log_file, "\
+n", "Max errors: ", $max_errors, "\n";
#
foreach $spec_line () { # Read full data file specification into memor
+y from the spec file, 
  # this will be used for "lookup"
  chomp ($spec_line); # remove the newline from $spec_line.
  @spec_one_attribute = split(/\,/, $spec_line); # the spec file is ',
+' delimited
    # @spec_one_attribute contains: attribute name, attribute data typ
+e (regular expression), and default value
  push (@spec_all_attributes, [@spec_one_attribute]); # @spec_all_attr
+ibutes contain the full data file specification
}
#
$line_number = 1; $total_errors = 0;
DATALINE: foreach $data_in_file () { # read data file, line by line
  chomp ($data_in_file); # remove the newline
  @data_in_attributes = split (/\|/, $data_in_file); # the data file i
+s '|' delimited
  if ($#data_in_attributes != $#spec_all_attributes) { 
    # number of attributes on the data line do not match with the spec
+ification
    $total_errors++;
    print log_file "Error ", $total_errors, ". On the data line: ", $l
+ine_number, 
      ", # attributes: ", $#data_in_attributes + 1, 
      ", do not match # attributes in the file specification: ", $#spe
+c_all_attributes + 1, "\n";
    last DATALINE if ($total_errors >= $max_errors); # terminate if to
+o many errors
    next; # skip data attribute type validation
  }      
  $attribute_position = 0; @data_out_attributes = ();
  foreach $attribute (@data_in_attributes) {
    if ($attribute =~ m/$spec_all_attributes[$attribute_position][1]/)
+ { # validate data attribute type by performing 
      # lookup for the regular expression from the spec memory structu
+re
      push (@data_out_attributes, $attribute); # Correct data type, th
+e output value is same as input value
    } else {
      push (@data_out_attributes, $spec_all_attributes[$attribute_posi
+tion][2]); 
        # Bad data type, use default provided in the spec for output v
+alue
      $total_errors++;
      print log_file "Error ", $total_errors, ". Data type error on li
+ne: ", $line_number, 
        ", attribute: ", $attribute_position + 1, " (", $spec_all_attr
+ibutes[$attribute_position][0], ")\n";
    }
    last DATALINE if ($total_errors >= $max_errors); # terminate if to
+o many errors
    $attribute_position++;
  }
  print data_out_file join ("|", @data_out_attributes), "\n"; # the da
+ta out file is '|' delimited
} continue { # update line number counter even if the data attribute t
+ype validation is skipped
  $line_number++;
}
#
if ($total_errors >= $max_errors) {
  print log_file "Max error count reached: ", $total_errors, ", proces
+s terminated\n";
} else {
  print log_file "Process completed with: ", $total_errors, " errors\n
+";
}
# End
Replies are listed 'Best First'.
Re: Data type validation using regular expressions
by liverpole (Monsignor) on May 27, 2007 at 14:00 UTC
        "Any input to make it better, or do differently..."

    First of all, when I run your program, I get errors:

    syntax error at validate.pl line 29, near "() " syntax error at validate.pl line 35, near "}" Execution of validate.pl aborted due to compilation errors.

    But I also recommend you get in the habit of using strict in your programs (you are already using -w, to turn on warnings, which is good):

    #!/usr/local/bin/perl -w use strict; use warnings; # Already on with -w, but doesn't hurt to be explicit # # Version 0.1 #

    At which point you will get a couple dozen warnings:

    Variable "$log_file" is not imported at validate.pl line 25. Variable "$spec_file" is not imported at validate.pl line 26. Variable "$data_in_file" is not imported at validate.pl line 27. Variable "$data_out_file" is not imported at validate.pl line 28. Variable "$spec_file" is not imported at validate.pl line 29. Variable "$data_in_file" is not imported at validate.pl line 29. Variable "$data_out_file" is not imported at validate.pl line 30. Variable "$log_file" is not imported at validate.pl line 30. Global symbol "$spec_file" requires explicit package name at validate. +pl line 24. Global symbol "$data_in_file" requires explicit package name at valida +te.pl line 24. Global symbol "$data_out_file" requires explicit package name at valid +ate.pl line 24. Global symbol "$log_file" requires explicit package name at validate.p +l line 24. Global symbol "$max_errors" requires explicit package name at validate +.pl line 24. Global symbol "$log_file" requires explicit package name at validate.p +l line 25. Global symbol "$log_file" requires explicit package name at validate.p +l line 25. Global symbol "$spec_file" requires explicit package name at validate. +pl line 26. Global symbol "$spec_file" requires explicit package name at validate. +pl line 26. Global symbol "$data_in_file" requires explicit package name at valida +te.pl line 27. Global symbol "$data_in_file" requires explicit package name at valida +te.pl line 27. Global symbol "$data_out_file" requires explicit package name at valid +ate.pl line 28. Global symbol "$data_out_file" requires explicit package name at valid +ate.pl line 28. Global symbol "$spec_file" requires explicit package name at validate. +pl line 29. Global symbol "$data_in_file" requires explicit package name at valida +te.pl line 29. Global symbol "$data_out_file" requires explicit package name at valid +ate.pl line 30. Global symbol "$log_file" requires explicit package name at validate.p +l line 30. Global symbol "$max_errors" requires explicit package name at validate +.pl line 30. Global symbol "$spec_line" requires explicit package name at validate. +pl line 32. syntax error at validate.pl line 32, near "() " validate.pl has too many errors.

    It appears that almost all of the errors in your program are caused by not declaring your variables.  In many cases this is easily fixed by using my to declare the variable.  For example:

    my ($spec_file, $data_in_file, $data_out_file, $log_file, $max_errors) + = @ARGV;

    which will get rid of many of the errors.

    In a few cases, you may have to declare the offending variable globally, in order to have it remain in scope everywhere it's used.  One such example is @spec_all_attributes; the first time you use it is within a foreach loop, so you should declare @spec_all_attributes before that loop.

    When you open a file, you can avoid the warnings by doing this:

    # Note that you don't have to put quotes "..." around $spec_file open ($spec_file, $spec_file) or die "Can not open file $spec_file, $! +";

    It's considered good practice to use the 3-argument form of open -- for example:

    open ($log_file, ">", $log_file) or die "Can not open file $log_file, +$!";

    Addionally, you might consider giving a syntax message if the number of command-line arguments isn't what's expected.  This isn't always just for others who use your program; you may come back to it months or years later, and wonder what the calling syntax was supposed to be.  For example, I'd be inclined to do something like the following:

    my $syntax = " syntax: $0 <specfile> <data_in> <data_out> <logfile> <max errors +> The purpose of this program is ... "; (my $spec_file = shift) or die $syntax; (my $data_in_file = shift) or die $syntax; (my $data_out_file = shift) or die $syntax; # etc...

    One final comment on the style -- it's usually considered unnecessary "noise" in a program to use comments which are obvious.  The classic example is:

    $i++; # Increment $i (duh!)

    So you may want to lighten up a little on comments which don't add anything, and as your comments make the code very hard to read (at least for me), you may want to rethink your commenting style.  Try putting comments on their own lines (rather than making lines longer than 80 characters even more long), and add some whitespace where it helps the readability.

    Thus, I'd suggest, instead of:

    foreach my $attribute (@data_in_attributes) { if ($attribute =~ m/$spec_all_attributes[$attribute_position][1]/) { # + validate data attribute type by performing # lookup for the regular expression from the spec memory structure push (@data_out_attributes, $attribute); # Correct data type, the ou +tput value is same as input value } else { push (@data_out_attributes, $spec_all_attributes[$attribute_position +][2]); # Bad data type, use default provided in the spec for output value $total_errors++; print log_file "Error ", $total_errors, ". Data type error on line: +", $line_number, ", attribute: ", $attribute_position + 1, " (", $spec_all_attribut +es[$attribute_position][0], ")\n"; } last DATALINE if ($total_errors >= $max_errors); # terminate if too ma +ny errors $attribute_position++; }

    that something like the following may be a lot easier to read:

    foreach my $attribute (@data_in_attributes) { # validate data attribute type by performing # lookup for the regular expression from the spec memory structure if ($attribute =~ m/$spec_all_attributes[$attribute_position][1]/) + { # Correct data type, the output value is same as input value push @data_out_attributes, $attribute; } else { # Bad data type, use default provided in the spec for output v +alue push @data_out_attributes, $spec_all_attributes[$attribute_pos +ition][2]; $total_errors++; print log_file "Error ", $total_errors, ". Data type error on line: ", $line_number, ", attribute: ", $attribute_position + 1, " (", $spec_all_attributes[$attribute_position][0], ")\n"; } # terminate if too many errors (<-- but perhaps this is obvious??) last DATALINE if ($total_errors >= $max_errors); $attribute_position++; }

    s''(q.S:$/9=(T1';s;(..)(..);$..=substr+crypt($1,$2),2,3;eg;print$..$/
      (my $spec_file = shift) or die $syntax; (my $data_in_file = shift) or die $syntax; (my $data_out_file = shift) or die $syntax; # etc...

      Yuk!

      die $syntax unless @ARGV == 5; my ($spec_file, $data_in_file, $data_out_file, $log_file, $max_errors) + = @ARGV;

      or

      my ($spec_file, $data_in_file, $data_out_file, $log_file, $max_errors) + = @ARGV; die $syntax unless defined $max_errors;

      Do not repeat yourself!

      use warnings; # Already on with -w, but doesn't hurt to be explicit
      How is -w less explicit than use warnings;, as far as enabling warning is concerned?

      Open source softwares? Share and enjoy. Make profit from them if you can. Yet, share and enjoy!

        Maybe it's just me; I feel like it's more explicit when I see it together with strict, spelled out:
        use strict; use warnings;

        But I confess that I still don't trust -w on Windows (even though I now know it works perfectly well), because Windows ignores the first part of the shebang line.  To test this, you can do:

        #!/usr/path/which/does/not/exist/perl

        and it'll still run Perl correctly.  Granted this isn't a reason to stop using -w, which still does work as I mentioned, but it did make me suspicious of the whole top line for quite a while.


        s''(q.S:$/9=(T1';s;(..)(..);$..=substr+crypt($1,$2),2,3;eg;print$..$/
      I did figure out why you were getting those syntax errors, if you copy the code from dwoptimize. The blogger was eating some of the < and > characters even with the html "code" tag defined. Fixed now.
Re: Data type validation using regular expressions
by Anonymous Monk on May 27, 2007 at 18:30 UTC
    Thanks for all the feedback provided, I have made several changes that were recommended. Pl review.
    liverpole: I am surprised that it would not compile for you, I did test it before posting it.
    Thanks again to everyone, for your time:
    #!/usr/local/bin/perl -w use strict; # # http://www.dwoptimize.com/2007/05/data-type-validation-using-regular +.html # jag.singh@dwoptimize.com # Version 0.2 # feedback incorporated from http://www.perlmonks.org/?node_id=61769 +2 # my $syntax = "> > Syntax: $0 <spec_file> <data_in_file> <data_out_file> <log_file> <ma +x_errors> > > This program: > > 1. Reads specification file that defines the data file layout: > 1.1. attribute name > 1.2. attribute data type using regular expressions > (http://www.perl.com/pub/a/2000/11/begperl3.html) > 1.3. default data value > > 2. Validates input data file for data type > > 3. Creates output data file, with bad attribute data values that do +not match > specifiction replaced by the default values > > 4. Creates a log file describing data validation errors > > 5. Aborts data validation process, if total number of errors reach m +ax_errors > > Terminating... "; # my ($spec_file, $data_in_file, $data_out_file, $log_file, $max_errors) + = @ARGV; die $syntax, "not all the required command line parameters are provide +d" unless defined $max_errors; open (log_file, ">", $log_file) or die $syntax, "cannot open ", $log_file, ". ", $!; open (spec_file, $spec_file) or die $syntax, "cannot open ", $spec_file, ". ", $!; open (data_in_file, $data_in_file) or die $syntax, "cannot open ", $data_in_file, ". ", $!; open (data_out_file, ">", $data_out_file) or die $syntax, "cannot open ", $data_out_file, ". ", $!; print log_file "Spec File> ", $spec_file, "\n", "Data In File> ", $data_in_file, "\n", "Data Out File> ", $data_out_file, "\n", "Log File> ", $log_file, "\n", "Max errors: ", $max_errors, "\n"; # my $spec_line; my @spec_one_attribute; my @spec_all_attributes; foreach $spec_line (<spec_file>) { # Read full data file specification into memory structure from the s +pec file # which will be used for "lookup" during data validation later chomp ($spec_line); # remove newline @spec_one_attribute = split(/\,/, $spec_line); # the spec file is ',' delimited # @spec_one_attribute contains: attribute name, # attribute data type (regular expression), and default value push (@spec_all_attributes, [@spec_one_attribute]); # @spec_all_attributes contain the full data file specification } # my $data_in_line; my @data_in_attributes; my $line_number = 1; my $total_errors = 0; DATALINE: foreach $data_in_line (<data_in_file>) { # read data file, line by line chomp ($data_in_line); # remove newline @data_in_attributes = split (/\|/, $data_in_line); # the data file is '|' delimited if ($#data_in_attributes != $#spec_all_attributes) { # number of attributes on the data line do not match with the spec +ification $total_errors++; print log_file "Error ", $total_errors, ". On the data line: ", $line_number, ", # attributes: ", $#data_in_attributes + 1, ", do not match # attributes in the file specification: ", $#spec_all_attributes + 1, "\n"; last DATALINE if ($total_errors >= $max_errors); next; # skip data attribute type validation } my $attribute; my $attribute_position = 0; my @data_out_attributes = + (); foreach $attribute (@data_in_attributes) { if ($attribute =~ m/$spec_all_attributes[$attribute_position][1]/) + { # validate data attribute type by performing # lookup for the regular expression from the spec memory structu +re push (@data_out_attributes, $attribute); # Correct data type, the output value is same as input value } else { push (@data_out_attributes, $spec_all_attributes[$attribute_posi +tion][2]); # Bad data type, use default provided in the spec for output v +alue $total_errors++; print log_file "Error ", $total_errors, ". Data type error on line: ", $line_number, ", attribute: ", $attribute_position + 1, " (", $spec_all_attributes[$attribute_position][0], ")\n"; } last DATALINE if ($total_errors >= $max_errors); $attribute_position++; } print data_out_file join ("|", @data_out_attributes), "\n"; # the data out file is '|' delimited } continue { # update line number counter even if the data attribute type # validation is skipped $line_number++; } # if ($total_errors >= $max_errors) { print log_file "Max error count reached: ", $total_errors, ", proces +s terminated\n"; } else { print log_file "Process completed with: ", $total_errors, " errors\n +"; } # End

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: sourcecode [id://617692]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others perusing the Monastery: (3)
As of 2024-04-19 17:29 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found