kerrya has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks, This is a re-submitted request (Extracting fields) with more detail provided.

I need to re-format the following data (Data View 1), provided as multi-line records, to single line
records with a primary key (Data View 2).

There are five record types (10,20,30,40,50) that are associated to one primary key.

The data definitions are (with example data from primary key 00001):

Row 1
record10 = integer (length 2) (10). primary key = integer (length 5).(00001) sub item number = integer (length 3).(225) start date = dd.mm.yyyy.(01.11.1996) end date = dd.mm.yyyy.(00.00.0000) category = alphanumeric (length 3).(BVF) group = alphanumeric (length 3).(AAA) subgroup = alphanumeric (length 3).(A65) item type = text (options: S or N).(S) fee type = text (options: N or D).(N) provider type = alphanumeric (length 3).(009) new item = text (options: Y or N).(Y) item change = text (options: Y or N).(N) procedure change = text (options: Y or N).(Y) description change = text (options: Y or N).(N) fee change = text (options: Y or N).(Y)
Row 2
record20 = integer (length 2).(20) start date = dd.mm.yyyy.(01.11.1996) fee = decimal (nnnnn.nn).(00098.05) benefit1 = decimal (nnnnn.nn).(00073.55) benefit2 = decimal (nnnnn.nn).(00083.35)
Row 3
record30 = integer (length 2).(30). start date = dd.mm.yyyy.(14.11.1996) description = alphanumeric (length 80).(This derived fee is for profes +sional attendances for GP and Specialist.)
Row 4
record40 = integer (length 2).(40) start date = dd.mm.yyyy.(23.12.1996) indicator = (Anaes.)
Row 5
record50 = integer (length 2).(50) start date = dd.mm.yyyy.(01.11.1997) description = alphanumeric (length 80).(Professional attendance being +an attendance at other than consulting rooms, by a general practition +er on not more than 1 patient.
DATA VIEW 1
100000122501.11.199600.00.0000BVFAAAA65SN009YNYNY 2001.11.199600098.0500073.5500083.35 3014.11.1996This derived fee is for professional attendances for GP an +d Specialist. 4023.12.1996(Anaes.) 5001.11.1997Professional attendance being an attendance at 5001.11.1997other than consulting rooms, by a general 5001.11.1997practitioner on not more than 1 patient. 100000222601.11.199600.00.0000BDGAABA66SN010YNYNY 2001.11.199600098.0500073.5500083.35 3014.11.1996This derived fee is for professional attendances by GP onl +y. 4023.12.1996(Anaes.) 5001.11.1997Professional attendance being an attendance at 5001.11.1997other than consulting rooms, by a general 5001.11.1997practitioner only on not more than 1 patient.
DATA VIEW 2
00001,10,225,01.11.1996,00.00.0000,BVF,AAA,A65,S,N,009,Y,N,Y,N,Y 00001,20,01.11.1996,00098.05,00073.55,00083.35 00001,30,14.11.1996,This derived fee is for professional attendances f +or GP and Specialist. 00001,40,23.12.1996,(Anaes.) 00001,50,01.11.1997,Professional attendance being an attendance at oth +er than consulting rooms, by a general practitioner on not more than +1 patient. 00002,10,226,01.11.1996,00.00.0000,BDG,AAB,A66,S,N,010,Y,N,Y,N,Y 00002,20,01.11.1996,00098.05,00073.55,00083.35 00002,30,14.11.1996,This derived fee is for professional attendances b +y GP only. 00002,40,23.12.1996,(Anaes.) 00002,50,01.11.1997,Professional attendance being an attendance at oth +er than consulting rooms, by a general practitioner only on not more +than 1 patient.

I would be very grateful for your suggestions.

Thanks in advance.

Janitored by Arunbear - added code tags, as per Monastery guidelines

Replies are listed 'Best First'.
Re: Re-formatting multi-line records
by ikegami (Patriarch) on Oct 25, 2004 at 00:05 UTC

    We've already shown you how to split these lines into fields. If you have problems with that, ask about that and post the code you've written so far. Please don't simply ask us to do your (presumably) for-pay work for you.

    Here's a snippet of the solution I wrote:

    s/^(\d\d)//; my $record_type = $1; ... s/^(\d+{5})//; $key = $1; ... my @fields = /^ (\d\d\.\d\d\.\d\d\d\d) # start date (\d\d\.\d\d\.\d\d\d\d) # end date (.{3}) # category (.{3}) # group ... /x; ...

      I agree with ikegami, by asking questions like this without any effort on your own side, you are not doing yourself a favor.

      What you really need to do now is to read a Perl book, and understand the basics. Focus on simple regexp, split() and substr().

      But you at least did one thing right, you picked Perl. Perl is perfect for this type of task. Or maybe someone picked it for you, but still a good choice.

        Thanks pg I am new to Perl and was pointed in Perl's direction to do the task. I am not a programmer which doesn't help. I have tried using split and substr, but the data position is not consistent for all records which doesn't help. Rather than waste your time I will do the fix manually in MS Excel.
      Thanks Ikegami. I will post my attempts in the future in addition to providing a description. I am a novice so I felt my code would not be useful. I appreciate the feedback from you (and all Monks). will be making a donation.
Re: Re-formatting multi-line records
by tachyon (Chancellor) on Oct 25, 2004 at 01:04 UTC

    It is a pretty easy task in perl. As you say you are not a programmer I will leave you to decipher this.

    local $/ = ''; my $date = qr/\d\d\.\d\d.\d\d\d\d/; my $fee = qr/\d\d\d\d\d\.\d\d/; my $one = qr/ ^ \d\d #record10 = integer (length 2) (10). (\d{5}) #primary key = integer (length 5).(00001) (\d{3}) #sub item number = integer (length 3).(225) ($date) #start date = dd.mm.yyyy.(01.11.1996) ($date) #end date = dd.mm.yyyy.(00.00.0000) (\w{3}) #category = alphanumeric (length 3).(BVF) (\w{3}) #group = alphanumeric (length 3).(AAA) (\w{3}) #subgroup = alphanumeric (length 3).(A65) ([SN]) #item type = text (options: S or N).(S) ([ND]) #fee type = text (options: N or D).(N) (\w{3}) #provider type = alphanumeric (length 3).(009) ([YN]) #new item = text (options: Y or N).(Y) ([YN]) #item change = text (options: Y or N).(N) ([YN]) #procedure change = text (options: Y or N).(Y) ([YN]) #description change = text (options: Y or N).(N) ([YN]) #fee change = text (options: Y or N).(Y) $ /x; my $two = qr/ ^ \d\d #record20 = integer (length 2).(20) ($date) #start date = dd.mm.yyyy.(01.11.1996) ($fee) #fee = decimal (nnnnn.nn).(00098.05) ($fee) #benefit1 = decimal (nnnnn.nn).(00073.55) ($fee) #benefit2 = decimal (nnnnn.nn).(00083.35) $ /x; my $gen = qr/ ^ \d\d ($date) (.*) $ /x; while(my $rec = <DATA>) { my $pkey, my @out = (); for my $line( split /[\n\r]/, $rec ) { my $type = substr $line,0,2; if ( $type == 10 ) { my @data = $line =~ m/$one/; $pkey = $data[0]; @out[$type] = join ',', $pkey, $type, @data; } elsif ( $type == 20 ) { my @data = $line =~ m/$two/; @out[$type] = join ',', $pkey, $type, @data; } elsif ( $line =~ m/$gen/ ) { my @data = ( $1, $2 ); if ( defined $out[$type] ) { $out[$type] .= ' ' . $2; # continuing string } else { $out[$type] = join ',', $pkey, $type, @data; } } else { die "Invalid record $rec\n\n$line\n" } } print join "\n", @out[10,20,30,40,50]; print "\n"; } __DATA__ 100000122501.11.199600.00.0000BVFAAAA65SN009YNYNY 2001.11.199600098.0500073.5500083.35 3014.11.1996This derived fee is for professional attendances for GP an +d Specialist. 4023.12.1996(Anaes.) 5001.11.1997Professional attendance being an attendance at 5001.11.1997other than consulting rooms, by a general 5001.11.1997practitioner on not more than 1 patient. 100000222601.11.199600.00.0000BDGAABA66SN010YNYNY 2001.11.199600098.0500073.5500083.35 3014.11.1996This derived fee is for professional attendances by GP onl +y. 4023.12.1996(Anaes.) 5001.11.1997Professional attendance being an attendance at 5001.11.1997other than consulting rooms, by a general 5001.11.1997practitioner only on not more than 1 patient.

    cheers

    tachyon

      Many thanks tachyon. I have purchased the complete perl reference and will go through your solution. In future I will post up my attempted program. Must admit I was too embaressed to post my first attempt.

        Good for you. Looks like you are working on Australian HIC data to me? Please note that you should inculde some data validation for the line types 10 and 20 - it is possible that the respective REs will not match so @data will be empty. I just assume it worked for the sake of brevity. Something as simple as die "Match failed:\n$line\n" unless @data might do.

        cheers

        tachyon

Re: Re-formatting multi-line records
by jZed (Prior) on Oct 24, 2004 at 23:01 UTC
    Please wrap your code in <code></code> tags, otherwise your square brackets get interpreted as PM links.