Extracting fields

kerrya has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.

Re: Extracting fields
by davido (Cardinal) on Oct 22, 2004 at 05:07 UTC

Are you evaling the output from Data::Dumper into a datastructure again so that it takes on its original form, or are you attempting to parse it as a single string?

What else will always occur in sequence with '20' and '50' so that you can detect the difference between the '20' in 'Y2001.11', and '11.2004'? In other words, what makes that first 20 special, and the second 20 not? Is there some pattern you could match? Are there a set number of space-delimited fields perhaps? There has to be some form of regularity, either in field width, field delimiters, text patterns, etc. If there's nothing regular about the data, you've got a problem.

The problem that we have is that we have no idea what criteria to use, since we know nothing about the data except for one single isolated example.

Dave

[reply]

Re^2: Extracting fields

by kerrya (Novice) on Oct 22, 2004 at 05:48 UTC

[reply]

Re^3: Extracting fields

by ikegami (Patriarch) on Oct 22, 2004 at 06:45 UTC

You didn't clearly express what you wanted as your final structure, so I hope this helps

my %records;
my $record;

while (<DATA>) {
   chomp;

   my $line = $_;
   s/^(\d\d)//;
   my $subrecord_type = $1;

   if ($subrecord_type == 10) {
      s/^(\d+)\s+//;
      my $procedure_num = $1;
      $records{$procedure_num} = $record = [];
   }

   push(@$record, $line);
}

require Data::Dumper;
print(Data::Dumper::Dumper(\%records));

__DATA__
1000001 01.11.199600.00.00001 A1 1 SN Y
2001.11.200400098.0500073.5500083.35
5001.11.1997Professional attendance being an attendance at
5001.11.1997other than consulting rooms, by a general
5001.11.1997practitioner on not more than 1 patient


output
======

$VAR1 = {
          '00001' => [
                       '1000001 01.11.199600.00.00001 A1 1 SN Y',
                       '2001.11.200400098.0500073.5500083.35',
                       '5001.11.1997Professional attendance being an a
+ttendance at',
                       '5001.11.1997other than consulting rooms, by a 
+general',
                       '5001.11.1997practitioner on not more than 1 pa
+tient',
                     ]
        };
[download]

=== or maybe ===

my %records;
my $record;

while (<DATA>) {
   chomp;

   my $line = $_;
   s/^(\d\d)//;
   my $subrecord_type = $1;

   if ($subrecord_type == 10) {
      s/^(\d+)\s+//;
      my $procedure_num = $1;
      $records{$procedure_num} = $record = {};

      s/^(\d\d\.\d\d\.\d\d\d\d)//;
      my $date1 = $1;

      s/^(\d\d\.\d\d\.\d\d\d\d)//;
      my $date2 = $1;

      my (
         $unknown1,
         $unknown2,
         $unknown3,
         $unknown4,
         $unknown5,
      ) = split(/\s+/, $_);

      %$record = (
         date1    => $date1,
         date2    => $date2,
         unknown1 => $unknown1,
         unknown2 => $unknown2,
         unknown3 => $unknown3,
         unknown4 => $unknown4,
         unknown5 => $unknown5,
         20       => [],
         30       => [],
         40       => [],
         50       => [],
      );

      next;
   }

   if ($subrecord_type == 20) {
      s/^(\d\d\.\d\d\.\d\d\d\d)//;
      my $date = $1;

      my (
         $unknown1,
         $unknown2,
         $unknown3,
         $unknown4,
      ) = split(/\./, $_);

      push(@{$record->{20}}, {
         date     => $date,
         unknown1 => $unknown1,
         unknown2 => $unknown2,
         unknown3 => $unknown3,
         unknown4 => $unknown4,
      });

      next;
   }

   if ($subrecord_type == 30) {
      # ...
      next;
   }

   if ($subrecord_type == 40) {
      s/^(\d\d\.\d\d\.\d\d\d\d)//;
      my $date = $1;

      push(@{$record->{40}}, {
         date => $date,
         text => $_,
      });

      next;
   }

   if ($subrecord_type == 50) {
      s/^(\d\d\.\d\d\.\d\d\d\d)//;
      my $date = $1;

      push(@{$record->{50}}, {
         date => $date,
         text => $_,
      });

      next;
   }
}

require Data::Dumper;
print(Data::Dumper::Dumper(\%records));

__DATA__
1000001 01.11.199600.00.00001 A1 1 SN Y
2001.11.200400098.0500073.5500083.35
5001.11.1997Professional attendance being an attendance at
5001.11.1997other than consulting rooms, by a general
5001.11.1997practitioner on not more than 1 patient


output
======

$VAR1 = {
          '00001' => {
                       'date1' => '01.11.1996',
                       'date2' => '00.00.0000',
                       'unknown1' => '1'
                       'unknown2' => 'A1',
                       'unknown3' => '1',
                       'unknown4' => 'SN',
                       'unknown5' => 'Y',
                       '20' => [
                                 {
                                   'date' => '01.11.2004',
                                   'unknown1' => '00098'
                                   'unknown2' => '0500073',
                                   'unknown3' => '5500083',
                                   'unknown4' => '35',
                                 }
                               ],
                       '30' => [],
                       '40' => [],
                       '50' => [
                                 {
                                   'date' => '01.11.1997'
                                   'text' => 'Professional attendance 
+being an attendance at',
                                 },
                                 {
                                   'date' => '01.11.1997'
                                   'text' => 'other than consulting ro
+oms, by a general',
                                 },
                                 {
                                   'date' => '01.11.1997'
                                   'text' => 'practitioner on not more
+ than 1 patient',
                                 }
                               ],
                     }
        };
[download]

[reply]
[d/l]
[select]

Re^3: Extracting fields

by bobf (Monsignor) on Oct 22, 2004 at 06:23 UTC

TMTOWTDI, but I would approach this problem using your original data set, before you stripped out the newlines. Based on the code you supplied to do that, it appears each block of procedure records begins with 7 consecutive digits, and the lines in between all begin with 4 digits. If that's the case, you can read in the data one line at a time and process each according to the line type (10, 20, etc), using the lines that begin with 7 digits to indicate the beginning of a new procedure record.

I'm not sure how you need each line processed or what the desired format is of the parsed data, but here's one way to do it:

use strict;
use warnings;

use Data::Dumper;

my ( $recordkey, %data );

while( my $line = <DATA> )
{
    chomp $line;

    if( $line =~ m/^(\d{7})/ )
    {
        # process the first line of a procedure record (type 10)
        $recordkey = $1;
        $data{$recordkey}{10} = $line;
    }
    elsif( $line =~ m/^(\d{2})\d{2}\./ )
    {
        # process types 20, 30, 40, 50
        push( @{ $data{$recordkey}{$1} }, $line );
    }
    else
    {
        # unrecognized line!
    }
}

print Dumper( \%data );

__DATA__
1000001 01.11.199600.00.00001 A1 1 SN Y
2001.11.200400098.0500073.5500083.35
5001.11.1997Professional attendance being an attendance at
5001.11.1997other than consulting rooms, by a general
5001.11.1997practitioner on not more than 1 patient
1000002 01.11.199600.00.00001 A1 1 SN Y
2001.11.200400098.0500073.5500083.35
5001.11.1997Professional attendance being an attendance at
5001.11.1997other than consulting rooms, by a general
5001.11.1997practitioner on not more than 1 patient
1000003 01.11.199600.00.00001 A1 1 SN Y
2001.11.200400098.0500073.5500083.35
5001.11.1997Professional attendance being an attendance at
5001.11.1997other than consulting rooms, by a general
5001.11.1997practitioner on not more than 1 patient

OUTPUT

$VAR1 = {
          '1000001' => {
                         '50' => [
                                   '5001.11.1997Professional attendanc
+e being an attendance at',
                                   '5001.11.1997other than consulting 
+rooms, by a general',
                                   '5001.11.1997practitioner on not mo
+re than 1 patient'
                                 ],
                         '10' => '1000001 01.11.199600.00.00001 A1 1 S
+N Y',
                         '20' => [
                                   '2001.11.200400098.0500073.5500083.
+35'
                                 ]
                       },
          '1000002' => {
                         '50' => [
                                   '5001.11.1997Professional attendanc
+e being an attendance at',
                                   '5001.11.1997other than consulting 
+rooms, by a general',
                                   '5001.11.1997practitioner on not mo
+re than 1 patient'
                                 ],
                         '10' => '1000002 01.11.199600.00.00001 A1 1 S
+N Y',
                         '20' => [
                                   '2001.11.200400098.0500073.5500083.
+35'
                                 ]
                       },
etc...
[download]

[reply]
[d/l]

Re: Extracting fields
by BrowserUk (Patriarch) on Oct 22, 2004 at 05:18 UTC

One sample is not enough, but maybe something like this?

#! perl -slw
use strict;

my %data = (
    1047633 =>
'01.12.199100.00.00003 T8 15 SN Y2001.11.200400095.8000071.8500081.454
+001.11.1994(Anaes.)5001.12.1991Metatarsal, 1 of, treatment of fractur
+e of'
);

my $re_date = qr[\d{2}\.\d{2}\.\d{4}];
my $re_float= qr[\d{5}\.\d{2}];

for my $key ( keys %data ) {
    my @fields = $data{ $key } =~ m[
        ( 20 $re_date $re_float{3} )
        ( 40 $re_date \( [^)]+ \) )
        ( 50 $re_date .* $ )
    ]x;

    print "'$_'" for @fields;
}

__END__
P:\test>401374
'2001.11.200400095.8000071.8500081.45'
'4001.11.1994(Anaes.)'
'5001.12.1991Metatarsal, 1 of, treatment of fracture of'
[download]

To explan the regex:

m[
  ## Capture, starting with '20', one date, and 3x %8.2 floats
  ( 20 $re_date $re_float{3} )

  ## Capture, starting with '40', one date, '(', non-')' to the ')'
  ( 40 $re_date \( [^)]+ \) )

  ## Capture, '50', a date, everything to the end of line.
  ( 50 $re_date .* $ )
]x; ## ignore whitespace.
[download]

Examine what is said, not who speaks.

"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
"Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon

[reply]
[d/l]
[select]

Re^2: Extracting fields

by kerrya (Novice) on Oct 22, 2004 at 06:05 UTC


          '1056070' => '01.05.200000.00.00005  I2    SN       Y2001.11
+.200400113.1500084.9000096.205001.05.2000Computed tomography - scan o
+f facial bones,5001.05.2000paranasal sinuses or both, with scan of br
+ain,5001.05.2000without intravenous contrast medium (R) (NK)5001.05.2
+000(Anaes.)',
          '1032042' => '01.12.199100.00.00003  T8 2  SN       Y2001.11
+.200401097.2000822.904001.11.1995(Anaes.)5001.12.1991Rectum and anus,
+ abdominoperineal resection of,5001.12.1991combined synchronous opera
+tion, abdominal5001.12.1991resection5001.12.1991(Assist.)',
          '1021432' => '01.11.200100.00.00003  T1011 SN       Y2001.11
+.200400084.2500063.2000071.655001.11.2001Initiation of management of 
+anaesthesia for5001.11.2001repair of arteriovenous fistula of knee or
+5001.11.2001popliteal area5001.11.2001(005)',
[download]

[reply]
[d/l]

Re^3: Extracting fields

by BrowserUk (Patriarch) on Oct 22, 2004 at 06:47 UTC

I agree with bobf, stripping the record sperarators and then trying to put them back is the wrong way to approach the problem.

Much better would be setting $/="\n10"; to slurp the combined records in one block at a time retaining the newlines. These then give you a very easy way to further break up the combined records.

Examine what is said, not who speaks.

[reply]
[d/l]
[select]

Re: Extracting fields
by ihb (Deacon) on Oct 22, 2004 at 05:10 UTC

Any roboust solution would require a lot more knowledge of the rules that the string conforms too. Can you express, in English, which elements that the string can be made up of? Most likely you'll describe a pattern ("skip the first four words, ignore the first char, take the next two, ..."), and then it's just a matter of translating it into regexes, which we can help you with.

Expressing the problem in plain English can be quite good.

Related documents:

• perlrequick - Perl regular expressions quick start

• perlretut - Perl regular expressions tutorial

• perlre - Perl regular expressions

• perlreref - Perl regular expressions reference

ihb

See perltoc if you don't know which perldoc to read!
Read argumentation in its context!

[reply]
[d/l]

Re: Extracting fields
by TedPride (Priest) on Oct 22, 2004 at 16:38 UTC

while (<DATA>) {
    chomp;
    $rec = substr($_,0,2); $_ = substr($_,2);
    if ($rec == 10) {
        $id = substr($_,0,5); $_ = substr($_,6);
    }
    $records{$id}{$rec}{'date'} = substr($_,0,10);
    if ($rec != 50 || !exists($records{$id}{50}{'data'})) {
        $records{$id}{$rec}{'data'} = substr($_,10);
    }
    else {
        $records{$id}{50}{'data'} .= ' ' . substr($_,10);
    }
}
for (sort keys %records) {
    $id = $_;
    print "$id\n";
    for (sort keys (%{$records{$id}})) {
        print "$_ : ".$records{$id}{$_}{'date'}." : ".$records{$id}{$_
+}{'data'}."\n";
    }
}

__DATA__
1000001 01.11.199600.00.00001 A1 1 SN Y
2001.11.200400098.0500073.5500083.35
5001.11.1997Professional attendance being an attendance at
5001.11.1997other than consulting rooms, by a general
5001.11.1997practitioner on not more than 1 patient
1000002 01.11.199600.00.00001 A1 1 SN Y
2001.11.200400098.0500073.5500083.35
5001.11.1997Professional attendance being an attendance at
5001.11.1997other than consulting rooms, by a general
5001.11.1997practitioner on not more than 1 patient
[download]

[reply]
[d/l]