in reply to Extracting fields

Are you evaling the output from Data::Dumper into a datastructure again so that it takes on its original form, or are you attempting to parse it as a single string?

What else will always occur in sequence with '20' and '50' so that you can detect the difference between the '20' in 'Y2001.11', and '11.2004'? In other words, what makes that first 20 special, and the second 20 not? Is there some pattern you could match? Are there a set number of space-delimited fields perhaps? There has to be some form of regularity, either in field width, field delimiters, text patterns, etc. If there's nothing regular about the data, you've got a problem.

The problem that we have is that we have no idea what criteria to use, since we know nothing about the data except for one single isolated example.


Dave

Replies are listed 'Best First'.
Re^2: Extracting fields
by kerrya (Novice) on Oct 22, 2004 at 05:48 UTC
    Apologies Dave, will be more specific.

    I am trying to re-format a medical file that describes 39 000 procedures. Each procedure has sub record types categorised as 10, 20, 30, 40 and 50.

    The problem is that rather than having all information relating to a procedure being available in one line, the procedure information has been separated into multiple lines (as below). The first row contains the record 10 number and the procedure number (00001), combined as 1000001 with record 10 information following. The next row has the record 20 number and record 20 information etc.

    1000001 01.11.199600.00.00001 A1 1 SN Y
    2001.11.200400098.0500073.5500083.35
    5001.11.1997Professional attendance being an attendance at
    5001.11.1997other than consulting rooms, by a general
    5001.11.1997practitioner on not more than 1 patient

    I have managed to get all the information related to a procedure onto one line using:

    my $text = do {local $/;<DATA>};
    $text =~ s/\n(?!\d{7})//g; # remove newline if no new record
    my %records = map {split /\s+/, $_, 2 } split /\n/, $text;
    print Dumper (\%records);

    Which gives:

    '1000001' => '01.11.199600.00.00001 A1 1 SN Y2001.11.200400098.0500073.5500083.355001.11.1997Professional attendance being an attendance at5001.11.1997other than consulting rooms, by a general5001.11.1997practitioner on not more than 1 patient',

    I am now trying the start of each sub record into it's own field.
    The position of sub group data is not consistent between procedure records. Also, the sub records (20-50) are not represented in all procedures.

      You didn't clearly express what you wanted as your final structure, so I hope this helps

      my %records; my $record; while (<DATA>) { chomp; my $line = $_; s/^(\d\d)//; my $subrecord_type = $1; if ($subrecord_type == 10) { s/^(\d+)\s+//; my $procedure_num = $1; $records{$procedure_num} = $record = []; } push(@$record, $line); } require Data::Dumper; print(Data::Dumper::Dumper(\%records)); __DATA__ 1000001 01.11.199600.00.00001 A1 1 SN Y 2001.11.200400098.0500073.5500083.35 5001.11.1997Professional attendance being an attendance at 5001.11.1997other than consulting rooms, by a general 5001.11.1997practitioner on not more than 1 patient output ====== $VAR1 = { '00001' => [ '1000001 01.11.199600.00.00001 A1 1 SN Y', '2001.11.200400098.0500073.5500083.35', '5001.11.1997Professional attendance being an a +ttendance at', '5001.11.1997other than consulting rooms, by a +general', '5001.11.1997practitioner on not more than 1 pa +tient', ] };

      === or maybe ===

      my %records; my $record; while (<DATA>) { chomp; my $line = $_; s/^(\d\d)//; my $subrecord_type = $1; if ($subrecord_type == 10) { s/^(\d+)\s+//; my $procedure_num = $1; $records{$procedure_num} = $record = {}; s/^(\d\d\.\d\d\.\d\d\d\d)//; my $date1 = $1; s/^(\d\d\.\d\d\.\d\d\d\d)//; my $date2 = $1; my ( $unknown1, $unknown2, $unknown3, $unknown4, $unknown5, ) = split(/\s+/, $_); %$record = ( date1 => $date1, date2 => $date2, unknown1 => $unknown1, unknown2 => $unknown2, unknown3 => $unknown3, unknown4 => $unknown4, unknown5 => $unknown5, 20 => [], 30 => [], 40 => [], 50 => [], ); next; } if ($subrecord_type == 20) { s/^(\d\d\.\d\d\.\d\d\d\d)//; my $date = $1; my ( $unknown1, $unknown2, $unknown3, $unknown4, ) = split(/\./, $_); push(@{$record->{20}}, { date => $date, unknown1 => $unknown1, unknown2 => $unknown2, unknown3 => $unknown3, unknown4 => $unknown4, }); next; } if ($subrecord_type == 30) { # ... next; } if ($subrecord_type == 40) { s/^(\d\d\.\d\d\.\d\d\d\d)//; my $date = $1; push(@{$record->{40}}, { date => $date, text => $_, }); next; } if ($subrecord_type == 50) { s/^(\d\d\.\d\d\.\d\d\d\d)//; my $date = $1; push(@{$record->{50}}, { date => $date, text => $_, }); next; } } require Data::Dumper; print(Data::Dumper::Dumper(\%records)); __DATA__ 1000001 01.11.199600.00.00001 A1 1 SN Y 2001.11.200400098.0500073.5500083.35 5001.11.1997Professional attendance being an attendance at 5001.11.1997other than consulting rooms, by a general 5001.11.1997practitioner on not more than 1 patient output ====== $VAR1 = { '00001' => { 'date1' => '01.11.1996', 'date2' => '00.00.0000', 'unknown1' => '1' 'unknown2' => 'A1', 'unknown3' => '1', 'unknown4' => 'SN', 'unknown5' => 'Y', '20' => [ { 'date' => '01.11.2004', 'unknown1' => '00098' 'unknown2' => '0500073', 'unknown3' => '5500083', 'unknown4' => '35', } ], '30' => [], '40' => [], '50' => [ { 'date' => '01.11.1997' 'text' => 'Professional attendance +being an attendance at', }, { 'date' => '01.11.1997' 'text' => 'other than consulting ro +oms, by a general', }, { 'date' => '01.11.1997' 'text' => 'practitioner on not more + than 1 patient', } ], } };

      TMTOWTDI, but I would approach this problem using your original data set, before you stripped out the newlines. Based on the code you supplied to do that, it appears each block of procedure records begins with 7 consecutive digits, and the lines in between all begin with 4 digits. If that's the case, you can read in the data one line at a time and process each according to the line type (10, 20, etc), using the lines that begin with 7 digits to indicate the beginning of a new procedure record.

      I'm not sure how you need each line processed or what the desired format is of the parsed data, but here's one way to do it:

      use strict; use warnings; use Data::Dumper; my ( $recordkey, %data ); while( my $line = <DATA> ) { chomp $line; if( $line =~ m/^(\d{7})/ ) { # process the first line of a procedure record (type 10) $recordkey = $1; $data{$recordkey}{10} = $line; } elsif( $line =~ m/^(\d{2})\d{2}\./ ) { # process types 20, 30, 40, 50 push( @{ $data{$recordkey}{$1} }, $line ); } else { # unrecognized line! } } print Dumper( \%data ); __DATA__ 1000001 01.11.199600.00.00001 A1 1 SN Y 2001.11.200400098.0500073.5500083.35 5001.11.1997Professional attendance being an attendance at 5001.11.1997other than consulting rooms, by a general 5001.11.1997practitioner on not more than 1 patient 1000002 01.11.199600.00.00001 A1 1 SN Y 2001.11.200400098.0500073.5500083.35 5001.11.1997Professional attendance being an attendance at 5001.11.1997other than consulting rooms, by a general 5001.11.1997practitioner on not more than 1 patient 1000003 01.11.199600.00.00001 A1 1 SN Y 2001.11.200400098.0500073.5500083.35 5001.11.1997Professional attendance being an attendance at 5001.11.1997other than consulting rooms, by a general 5001.11.1997practitioner on not more than 1 patient OUTPUT $VAR1 = { '1000001' => { '50' => [ '5001.11.1997Professional attendanc +e being an attendance at', '5001.11.1997other than consulting +rooms, by a general', '5001.11.1997practitioner on not mo +re than 1 patient' ], '10' => '1000001 01.11.199600.00.00001 A1 1 S +N Y', '20' => [ '2001.11.200400098.0500073.5500083. +35' ] }, '1000002' => { '50' => [ '5001.11.1997Professional attendanc +e being an attendance at', '5001.11.1997other than consulting +rooms, by a general', '5001.11.1997practitioner on not mo +re than 1 patient' ], '10' => '1000002 01.11.199600.00.00001 A1 1 S +N Y', '20' => [ '2001.11.200400098.0500073.5500083. +35' ] }, etc...