kerrya has asked for the wisdom of the Perl Monks concerning the following question:

I need to separate the following output from Data Dumper into fields:

'1047633' => '01.12.199100.00.00003 T8 15 SN Y2001.11.200400095.8000071.8500081.454001.11.1994(Anaes.)5001.12.1991Metatarsal, 1 of, treatment of fracture of',

The start of each field is highlighted in bold.

As the location of each field marker changes with each record I have been unable to get substr to work.

Would be grateful for any help.

Replies are listed 'Best First'.
Re: Extracting fields
by davido (Cardinal) on Oct 22, 2004 at 05:07 UTC

    Are you evaling the output from Data::Dumper into a datastructure again so that it takes on its original form, or are you attempting to parse it as a single string?

    What else will always occur in sequence with '20' and '50' so that you can detect the difference between the '20' in 'Y2001.11', and '11.2004'? In other words, what makes that first 20 special, and the second 20 not? Is there some pattern you could match? Are there a set number of space-delimited fields perhaps? There has to be some form of regularity, either in field width, field delimiters, text patterns, etc. If there's nothing regular about the data, you've got a problem.

    The problem that we have is that we have no idea what criteria to use, since we know nothing about the data except for one single isolated example.


    Dave

      Apologies Dave, will be more specific.

      I am trying to re-format a medical file that describes 39 000 procedures. Each procedure has sub record types categorised as 10, 20, 30, 40 and 50.

      The problem is that rather than having all information relating to a procedure being available in one line, the procedure information has been separated into multiple lines (as below). The first row contains the record 10 number and the procedure number (00001), combined as 1000001 with record 10 information following. The next row has the record 20 number and record 20 information etc.

      1000001 01.11.199600.00.00001 A1 1 SN Y
      2001.11.200400098.0500073.5500083.35
      5001.11.1997Professional attendance being an attendance at
      5001.11.1997other than consulting rooms, by a general
      5001.11.1997practitioner on not more than 1 patient

      I have managed to get all the information related to a procedure onto one line using:

      my $text = do {local $/;<DATA>};
      $text =~ s/\n(?!\d{7})//g; # remove newline if no new record
      my %records = map {split /\s+/, $_, 2 } split /\n/, $text;
      print Dumper (\%records);

      Which gives:

      '1000001' => '01.11.199600.00.00001 A1 1 SN Y2001.11.200400098.0500073.5500083.355001.11.1997Professional attendance being an attendance at5001.11.1997other than consulting rooms, by a general5001.11.1997practitioner on not more than 1 patient',

      I am now trying the start of each sub record into it's own field.
      The position of sub group data is not consistent between procedure records. Also, the sub records (20-50) are not represented in all procedures.

        You didn't clearly express what you wanted as your final structure, so I hope this helps

        my %records; my $record; while (<DATA>) { chomp; my $line = $_; s/^(\d\d)//; my $subrecord_type = $1; if ($subrecord_type == 10) { s/^(\d+)\s+//; my $procedure_num = $1; $records{$procedure_num} = $record = []; } push(@$record, $line); } require Data::Dumper; print(Data::Dumper::Dumper(\%records)); __DATA__ 1000001 01.11.199600.00.00001 A1 1 SN Y 2001.11.200400098.0500073.5500083.35 5001.11.1997Professional attendance being an attendance at 5001.11.1997other than consulting rooms, by a general 5001.11.1997practitioner on not more than 1 patient output ====== $VAR1 = { '00001' => [ '1000001 01.11.199600.00.00001 A1 1 SN Y', '2001.11.200400098.0500073.5500083.35', '5001.11.1997Professional attendance being an a +ttendance at', '5001.11.1997other than consulting rooms, by a +general', '5001.11.1997practitioner on not more than 1 pa +tient', ] };

        === or maybe ===

        my %records; my $record; while (<DATA>) { chomp; my $line = $_; s/^(\d\d)//; my $subrecord_type = $1; if ($subrecord_type == 10) { s/^(\d+)\s+//; my $procedure_num = $1; $records{$procedure_num} = $record = {}; s/^(\d\d\.\d\d\.\d\d\d\d)//; my $date1 = $1; s/^(\d\d\.\d\d\.\d\d\d\d)//; my $date2 = $1; my ( $unknown1, $unknown2, $unknown3, $unknown4, $unknown5, ) = split(/\s+/, $_); %$record = ( date1 => $date1, date2 => $date2, unknown1 => $unknown1, unknown2 => $unknown2, unknown3 => $unknown3, unknown4 => $unknown4, unknown5 => $unknown5, 20 => [], 30 => [], 40 => [], 50 => [], ); next; } if ($subrecord_type == 20) { s/^(\d\d\.\d\d\.\d\d\d\d)//; my $date = $1; my ( $unknown1, $unknown2, $unknown3, $unknown4, ) = split(/\./, $_); push(@{$record->{20}}, { date => $date, unknown1 => $unknown1, unknown2 => $unknown2, unknown3 => $unknown3, unknown4 => $unknown4, }); next; } if ($subrecord_type == 30) { # ... next; } if ($subrecord_type == 40) { s/^(\d\d\.\d\d\.\d\d\d\d)//; my $date = $1; push(@{$record->{40}}, { date => $date, text => $_, }); next; } if ($subrecord_type == 50) { s/^(\d\d\.\d\d\.\d\d\d\d)//; my $date = $1; push(@{$record->{50}}, { date => $date, text => $_, }); next; } } require Data::Dumper; print(Data::Dumper::Dumper(\%records)); __DATA__ 1000001 01.11.199600.00.00001 A1 1 SN Y 2001.11.200400098.0500073.5500083.35 5001.11.1997Professional attendance being an attendance at 5001.11.1997other than consulting rooms, by a general 5001.11.1997practitioner on not more than 1 patient output ====== $VAR1 = { '00001' => { 'date1' => '01.11.1996', 'date2' => '00.00.0000', 'unknown1' => '1' 'unknown2' => 'A1', 'unknown3' => '1', 'unknown4' => 'SN', 'unknown5' => 'Y', '20' => [ { 'date' => '01.11.2004', 'unknown1' => '00098' 'unknown2' => '0500073', 'unknown3' => '5500083', 'unknown4' => '35', } ], '30' => [], '40' => [], '50' => [ { 'date' => '01.11.1997' 'text' => 'Professional attendance +being an attendance at', }, { 'date' => '01.11.1997' 'text' => 'other than consulting ro +oms, by a general', }, { 'date' => '01.11.1997' 'text' => 'practitioner on not more + than 1 patient', } ], } };

        TMTOWTDI, but I would approach this problem using your original data set, before you stripped out the newlines. Based on the code you supplied to do that, it appears each block of procedure records begins with 7 consecutive digits, and the lines in between all begin with 4 digits. If that's the case, you can read in the data one line at a time and process each according to the line type (10, 20, etc), using the lines that begin with 7 digits to indicate the beginning of a new procedure record.

        I'm not sure how you need each line processed or what the desired format is of the parsed data, but here's one way to do it:

        use strict; use warnings; use Data::Dumper; my ( $recordkey, %data ); while( my $line = <DATA> ) { chomp $line; if( $line =~ m/^(\d{7})/ ) { # process the first line of a procedure record (type 10) $recordkey = $1; $data{$recordkey}{10} = $line; } elsif( $line =~ m/^(\d{2})\d{2}\./ ) { # process types 20, 30, 40, 50 push( @{ $data{$recordkey}{$1} }, $line ); } else { # unrecognized line! } } print Dumper( \%data ); __DATA__ 1000001 01.11.199600.00.00001 A1 1 SN Y 2001.11.200400098.0500073.5500083.35 5001.11.1997Professional attendance being an attendance at 5001.11.1997other than consulting rooms, by a general 5001.11.1997practitioner on not more than 1 patient 1000002 01.11.199600.00.00001 A1 1 SN Y 2001.11.200400098.0500073.5500083.35 5001.11.1997Professional attendance being an attendance at 5001.11.1997other than consulting rooms, by a general 5001.11.1997practitioner on not more than 1 patient 1000003 01.11.199600.00.00001 A1 1 SN Y 2001.11.200400098.0500073.5500083.35 5001.11.1997Professional attendance being an attendance at 5001.11.1997other than consulting rooms, by a general 5001.11.1997practitioner on not more than 1 patient OUTPUT $VAR1 = { '1000001' => { '50' => [ '5001.11.1997Professional attendanc +e being an attendance at', '5001.11.1997other than consulting +rooms, by a general', '5001.11.1997practitioner on not mo +re than 1 patient' ], '10' => '1000001 01.11.199600.00.00001 A1 1 S +N Y', '20' => [ '2001.11.200400098.0500073.5500083. +35' ] }, '1000002' => { '50' => [ '5001.11.1997Professional attendanc +e being an attendance at', '5001.11.1997other than consulting +rooms, by a general', '5001.11.1997practitioner on not mo +re than 1 patient' ], '10' => '1000002 01.11.199600.00.00001 A1 1 S +N Y', '20' => [ '2001.11.200400098.0500073.5500083. +35' ] }, etc...
Re: Extracting fields
by BrowserUk (Patriarch) on Oct 22, 2004 at 05:18 UTC

    One sample is not enough, but maybe something like this?

    #! perl -slw use strict; my %data = ( 1047633 => '01.12.199100.00.00003 T8 15 SN Y2001.11.200400095.8000071.8500081.454 +001.11.1994(Anaes.)5001.12.1991Metatarsal, 1 of, treatment of fractur +e of' ); my $re_date = qr[\d{2}\.\d{2}\.\d{4}]; my $re_float= qr[\d{5}\.\d{2}]; for my $key ( keys %data ) { my @fields = $data{ $key } =~ m[ ( 20 $re_date $re_float{3} ) ( 40 $re_date \( [^)]+ \) ) ( 50 $re_date .* $ ) ]x; print "'$_'" for @fields; } __END__ P:\test>401374 '2001.11.200400095.8000071.8500081.45' '4001.11.1994(Anaes.)' '5001.12.1991Metatarsal, 1 of, treatment of fracture of'

    To explan the regex:

    m[ ## Capture, starting with '20', one date, and 3x %8.2 floats ( 20 $re_date $re_float{3} ) ## Capture, starting with '40', one date, '(', non-')' to the ')' ( 40 $re_date \( [^)]+ \) ) ## Capture, '50', a date, everything to the end of line. ( 50 $re_date .* $ ) ]x; ## ignore whitespace.

    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
    "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
      Thanks for your suggestion.

      I have included 3 records below. Please see my response to Dave's feedback for a more complete description of the problem.

      Appreciate your help.

      '1056070' => '01.05.200000.00.00005 I2 SN Y2001.11 +.200400113.1500084.9000096.205001.05.2000Computed tomography - scan o +f facial bones,5001.05.2000paranasal sinuses or both, with scan of br +ain,5001.05.2000without intravenous contrast medium (R) (NK)5001.05.2 +000(Anaes.)', '1032042' => '01.12.199100.00.00003 T8 2 SN Y2001.11 +.200401097.2000822.904001.11.1995(Anaes.)5001.12.1991Rectum and anus, + abdominoperineal resection of,5001.12.1991combined synchronous opera +tion, abdominal5001.12.1991resection5001.12.1991(Assist.)', '1021432' => '01.11.200100.00.00003 T1011 SN Y2001.11 +.200400084.2500063.2000071.655001.11.2001Initiation of management of +anaesthesia for5001.11.2001repair of arteriovenous fistula of knee or +5001.11.2001popliteal area5001.11.2001(005)',

        I agree with bobf, stripping the record sperarators and then trying to put them back is the wrong way to approach the problem.

        Much better would be setting $/="\n10"; to slurp the combined records in one block at a time retaining the newlines. These then give you a very easy way to further break up the combined records.


        Examine what is said, not who speaks.
        "Efficiency is intelligent laziness." -David Dunham
        "Think for yourself!" - Abigail
        "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
Re: Extracting fields
by ihb (Deacon) on Oct 22, 2004 at 05:10 UTC

    Any roboust solution would require a lot more knowledge of the rules that the string conforms too. Can you express, in English, which elements that the string can be made up of? Most likely you'll describe a pattern ("skip the first four words, ignore the first char, take the next two, ..."), and then it's just a matter of translating it into regexes, which we can help you with.

    Expressing the problem in plain English can be quite good.

    Related documents:
    perlrequick - Perl regular expressions quick start
    perlretut - Perl regular expressions tutorial
    perlre - Perl regular expressions
    perlreref - Perl regular expressions reference

    ihb

    See perltoc if you don't know which perldoc to read!
    Read argumentation in its context!

Re: Extracting fields
by TedPride (Priest) on Oct 22, 2004 at 16:38 UTC
    Try the following.
    while (<DATA>) { chomp; $rec = substr($_,0,2); $_ = substr($_,2); if ($rec == 10) { $id = substr($_,0,5); $_ = substr($_,6); } $records{$id}{$rec}{'date'} = substr($_,0,10); if ($rec != 50 || !exists($records{$id}{50}{'data'})) { $records{$id}{$rec}{'data'} = substr($_,10); } else { $records{$id}{50}{'data'} .= ' ' . substr($_,10); } } for (sort keys %records) { $id = $_; print "$id\n"; for (sort keys (%{$records{$id}})) { print "$_ : ".$records{$id}{$_}{'date'}." : ".$records{$id}{$_ +}{'data'}."\n"; } } __DATA__ 1000001 01.11.199600.00.00001 A1 1 SN Y 2001.11.200400098.0500073.5500083.35 5001.11.1997Professional attendance being an attendance at 5001.11.1997other than consulting rooms, by a general 5001.11.1997practitioner on not more than 1 patient 1000002 01.11.199600.00.00001 A1 1 SN Y 2001.11.200400098.0500073.5500083.35 5001.11.1997Professional attendance being an attendance at 5001.11.1997other than consulting rooms, by a general 5001.11.1997practitioner on not more than 1 patient