Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi All

Particular thanks to tachyon, ikegami and Dave for assisting in getting the code together (see below).

My code is matching and processing data successfully except the following record:
1010950 01.07.200400.00.00008 M3 DN Y 2001.11.200400052.85 00044.95 5001.07.2004Aboriginal or Torres Strait Islander health 5001.07.2004service provided to a person by an eligible 5001.07.2004Aboriginal health worker
The data definition is (with data from the problem record):
Row 1 record10 = integer (length 2) (10). primary key = integer (length 5).(10950) sub item number = integer (length 3).( ) start date = dd.mm.yyyy.(01.07.2004) end date = dd.mm.yyyy.(00.00.0000) category = alphanumeric (length 3).(8 ) group = alphanumeric (length 3).(M3 ) subgroup = alphanumeric (length 3).( ) item type = text (options: S or N).(D) fee type = text (options: N or D).(N) provider type = alphanumeric (length 3).( ) new item = text (options: Y or N).( ) item change = text (options: Y or N).( ) procedure change = text (options: Y or N).( ) description change = text (options: Y or N).( ) fee change = text (options: Y or N).(Y) Row 2 record20 = integer (length 2).(20) start date = dd.mm.yyyy.(01.11.2004) fee = decimal (nnnnn.nn).(00052.85) benefit1 = decimal (nnnnn.nn).( ) benefit2 = decimal (nnnnn.nn).(00044.95) Row 5 record50 = integer (length 2).(50) start date = dd.mm.yyyy.(01.07.2004) description = alphanumeric (length 80).(Aboriginal or Torres Strait Is +lander health service)

I suspect the problem could be matching the group data.

Thanks in advance for any comments or suggestions.

The Code:
#!/usr/bin/perl -w use strict; local $/ = ''; my $date = qr/\d\d\.\d\d.\d\d\d\d/; my $fee = qr/\d\d\d\d\d\.\d\d/; my $digit3 = qr/\d\d\d|\d\s\s|\d\d\s|\s\s\d|\W{3}/; my $alpha3 = qr/\w\w\w|\w\w\s|\w\s\s|\s\w\s|\s\s\w|\s\w\w|\W{3}/; my $one = qr/ ^ \d\d (\d{5}) ($digit3) ($date) ($date) ($alpha3) ($alpha3) ($alpha3) ([SN]|\W{1}) ([ND]|\W{1}) ($alpha3) ([YN]|\W{1}) ([YN]|\W{1}) ([YN]|\W{1}) ([YN]|\W{1}) ([YN]|\W{1}) $ /x; my $two = qr/ ^ \d\d #record20 = integer (length 2).(20) ($date) #start date = dd.mm.yyyy.(01.11.1996) ($fee) #fee = decimal (nnnnn.nn).(00098.05) ($fee) #benefit1 = decimal (nnnnn.nn).(00073.55) ($fee) #benefit2 = decimal (nnnnn.nn).(00083.35) $ /x; my $gen = qr/ ^ \d\d ($date) (.*) $ /x; while(my $rec = <DATA>) { my $pkey, my @out = (); for my $line( split /[\n\r]/, $rec ) { my $type = substr $line,0,2; if ( $type == 10 ) { my @data = $line =~ m/$one/; die error( $line, $rec, $one ) unless @data; $pkey = $data[0]; $out[$type] = [$pkey, $type, @data]; } elsif ( $type == 20 ) { my @data = $line =~ m/$two/; die error( $line, $rec, $two ) unless @data; $out[$type] = [$pkey, $type, @data]; } elsif ( $line =~ m/$gen/ ) { my @data = ( $1, $2 ); if ( defined $out[$type] ) { $out[$type][-1] .= ' ' . $2; # continuing string } else { $out[$type] = [$pkey, $type, @data]; } } else { die error( $line, $rec, $gen ); } } format_output( [@out[10,20,30,40,50]], ",", 'quote' ); } sub format_output { my ($aryref, $sep, $quote ) = @_; for my $ref(@$aryref) { @$ref = map{ s/"/\\"/; qq!"$_"! } @$ref if $quote; print join $sep, @$ref; print "\n"; } } sub error { my ( $line, $rec, $re ) = @_; return "Invalid Line:\n$line\n\nRE:\n$re\n\nRecord:\n$rec\n"; } __DATA__ 1010950 01.07.200400.00.00008 M3 DN Y 2001.11.200400052.85 00044.95 5001.07.2004Aboriginal or Torres Strait Islander health 5001.07.2004service

Janitored by Arunbear - added readmore tags, as per Monastery guidelines

Replies are listed 'Best First'.
Re: Can't match record
by tachyon (Chancellor) on Oct 29, 2004 at 05:10 UTC
    Did it occur to you that the data is invalid? The original regexen were designed to both gather and validate the data. If you just want to gather the data without validating it just replace every \w or \d or with '.' This will match happily at the expense of most of the validation.
    my $date = qr/\d\d\.\d\d.\d\d\d\d/; my $fee = qr/\d\d\d\d\d\.\d\d| {8}/; my $digit3 = qr/[\d ]{3}/; my $alpha3 = qr/[\w ]{3}/; my $one = qr/ ^ \d\d (\d{5}) ($digit3) ($date) ($date) ($alpha3) ($alpha3) ($alpha3) (.) (.) ($alpha3) (.) (.) (.) (.) (.) # $ # the first line is an invalid length as well, it has trailin +g spaces /x;

    cheers

    tachyon

      tachyon, Didn't detect the trailing spaces, thanks for pointing this out.
      Your suggestion works well. Thanks for this and your previous advice. Looking forward to learning more about regular expressions.