TMTOWTDI, but I would approach this problem using your original data set, before you stripped out the newlines. Based on the code you supplied to do that, it appears each block of procedure records begins with 7 consecutive digits, and the lines in between all begin with 4 digits. If that's the case, you can read in the data one line at a time and process each according to the line type (10, 20, etc), using the lines that begin with 7 digits to indicate the beginning of a new procedure record.

I'm not sure how you need each line processed or what the desired format is of the parsed data, but here's one way to do it:

use strict; use warnings; use Data::Dumper; my ( $recordkey, %data ); while( my $line = <DATA> ) { chomp $line; if( $line =~ m/^(\d{7})/ ) { # process the first line of a procedure record (type 10) $recordkey = $1; $data{$recordkey}{10} = $line; } elsif( $line =~ m/^(\d{2})\d{2}\./ ) { # process types 20, 30, 40, 50 push( @{ $data{$recordkey}{$1} }, $line ); } else { # unrecognized line! } } print Dumper( \%data ); __DATA__ 1000001 01.11.199600.00.00001 A1 1 SN Y 2001.11.200400098.0500073.5500083.35 5001.11.1997Professional attendance being an attendance at 5001.11.1997other than consulting rooms, by a general 5001.11.1997practitioner on not more than 1 patient 1000002 01.11.199600.00.00001 A1 1 SN Y 2001.11.200400098.0500073.5500083.35 5001.11.1997Professional attendance being an attendance at 5001.11.1997other than consulting rooms, by a general 5001.11.1997practitioner on not more than 1 patient 1000003 01.11.199600.00.00001 A1 1 SN Y 2001.11.200400098.0500073.5500083.35 5001.11.1997Professional attendance being an attendance at 5001.11.1997other than consulting rooms, by a general 5001.11.1997practitioner on not more than 1 patient OUTPUT $VAR1 = { '1000001' => { '50' => [ '5001.11.1997Professional attendanc +e being an attendance at', '5001.11.1997other than consulting +rooms, by a general', '5001.11.1997practitioner on not mo +re than 1 patient' ], '10' => '1000001 01.11.199600.00.00001 A1 1 S +N Y', '20' => [ '2001.11.200400098.0500073.5500083. +35' ] }, '1000002' => { '50' => [ '5001.11.1997Professional attendanc +e being an attendance at', '5001.11.1997other than consulting +rooms, by a general', '5001.11.1997practitioner on not mo +re than 1 patient' ], '10' => '1000002 01.11.199600.00.00001 A1 1 S +N Y', '20' => [ '2001.11.200400098.0500073.5500083. +35' ] }, etc...

In reply to Re^3: Extracting fields by bobf
in thread Extracting fields by kerrya

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.