Re: file parsing help

You might try it this way:

#! perl -slw
use strict;

$/ = "\n1SYSTEM"; ## para mode

my @fields;
while( my $page = <DATA> ) {
    ## Extract date
    $page =~ m[ACTUALS THRU (\d\d)/(\d\d)] 
        and my $period = "${1}20${2}" 
        or die "Couldn't get date";

    ## Extract body and split into lines 
    ## discarding total(*) lines 
    $page =~ m[COST\n(.+)]s 
        and my @lines = grep{!m[^0.\*] } split "\n", $1 
        or last;

    for my $line ( @lines ) {
        my @temp;
        ## Split the line into fields. Skip the last line
        eval{ 
           @temp = unpack 
               'xa5x2a3x2a15x2a6x2a5x2a5x4a5x9a8x4a9x5a5x15a5x4a9', 
               $line; 
        } or last;
 
        ## Fill in th missing fields from previous line
        $temp[ $_ ] =~ m[^\s+$] and $temp[ $_ ] = $fields[ $_ ] for 0,
+ 1, 2, 3;

        ## output formatted appropriately
        print join '|', $period, @temp;

        ## Save fields for in-filling.
        @fields = @temp;
    }
}
__DATA__
[download]

With Your input pasted into the DATA section, this produces:

C:\test>583749
082006|F1150|ABC|KELLY J.       |AAF113|FJO1A|FTO5A|284.0|    1.63| 6,
+688.22|735.0|4.23 |7,296.52
082006|F1150|ABC|KELLY J.       |AAF113|FJO1A|FTO5D| 38.0|     .22|   
+893.91| 90.0| .52 |2,128.73
082006|F1150|ABC|KELLY J.       |AAF113|FJO1A|FTW5T|  6.0|     .03|   
+135.07|  6.0| .03 |  135.07
082006|F1150|CDE|DEBORAH M.     |AAF103|FJB1A|FTB5A|  3.0|     .02|   
+107.83|  3.0| .02 |  107.83
082006|F1150|CDE|DEBORAH M.     |AAF103|FJB1A|FTB5B|     |        |   
+      | 21.5| .14 |  881.81
082006|F1150|CDE|DEBORAH M.     |AAF103|FJB1A|FTB5D|     |        |   
+      |  5.5| .03 |  194.37
082006|F1150|CDE|DEBORAH M.     |AAF103|FJB1A|FTB5G|  5.5|     .03|   
+192.11| 22.0| .11 |  790.06
082006|F1150|CDE|DEBORAH M.     |AAF103|FJB1A|FTW5U|     |        |   
+      |  1.0| .01 |   41.20
082006|F1150|CDE|DEBORAH M.     |AAF103|FJG1N|FTG5C|     |        |   
+      | 17.0| .11 |  700.26
082006|F1150|CDE|DEBORAH M.     |AAF103|FJG1N|FTG5E| 15.5|     .09|   
+557.19| 15.5| .09 |  557.19
082006|F1150|CDE|DEBORAH M.     |AAF103|FJG1N|FTW5A|  1.0|        |   
+ 35.95|  1.0|     |   35.95
082006|F1150|CDE|DEBORAH M.     |AAF103|FJG1N|FTW5G|     |        |   
+      |  1.5| .01 |   61.79
082006|F1150|CDE|DEBORAH M.     |AAF103|FJG1N|FTW5H|     |        |   
+      |  1.0| .01 |   41.20
082006|F1150|CDE|DEBORAH M.     |AAF103|FJG1N|FTW5T|  1.0|        |   
+ 35.95|  3.0| .01 |  118.34
082006|F1150|CDE|DEBORAH M.     |AAF103|FJG1N|FTW5U|     |        |   
+      |  5.0| .03 |  205.96
082006|F1150|CDE|DEBORAH M.     |AAF103|FJG1Q|FTG5C|     |        |   
+      |  2.0| .01 |   70.69
082006|F1150|CDE|DEBORAH M.     |AAF103|FJG1V|FTG5E| 64.0|     .33| 2,
+140.75| 64.0| .33 |2,140.75
082006|F1150|CDE|DEBORAH M.     |AAF103|FJG2A|FTG5C|     |        |   
+      |  2.0| .01 |   70.69
082006|F1150|CDE|DEBORAH M.     |AAF103|FJG2A|FTW5E|     |        |   
+      |  1.0| .01 |   41.20
082006|F1150|CDE|DEBORAH M.     |AAF103|FJG2A|FTW5J|     |        |   
+      |  9.0| .05 |  370.75
082006|F1150|CDE|DEBORAH M.     |AAF103|FJG2A|FTW5T|  5.5|     .03|   
+197.72|  5.5| .03 |  197.72
082006|F1150|CDE|DEBORAH M.     |AAF103|FJO1A|FTO5D|219.0|    1.14| 7,
+587.85|432.0|2.34 |5,578.73
082006|F1150|CDE|DEBORAH M.     |AAF103|FJO1A|FTW5E|     |        |   
+      |  1.0| .01 |   41.20
082006|F1150|CDE|DEBORAH M.     |AAF103|FJO1A|FTW5G|  1.0|        |   
+ 35.95|  1.0|     |   35.95
082006|F1150|CDE|DEBORAH M.     |AAF103|FJO1A|FTW5T|     |        |   
+      | 65.5| .37 |2,507.55
082006|F1150|CDE|DEBORAH M.     |AAF103|FJO1A|FTW5U|     |        |   
+      |  3.0| .02 |  106.00
082006|F1150|CDE|DEBORAH M.     |AAF103|FJO1A|FTW5V| 34.5|     .19| 1,
+203.74| 84.5| .49 |3,103.17
082006|F1150|CDE|DEBORAH M.     |AAF103|FJO1A|FTW5W|  2.0|     .01|   
+ 66.30|  6.0| .04 |  219.51
082006|F1150|HIF|CRAIG          |AAF040|FJB1A|FTB5B|145.0|     .82| 5,
+390.09|536.0|3.05 |9,574.79
082006|F1150|CMV|MARGARET S     |AAF070|FJB1A|FTB5B|     |        |   
+      |138.0| .86 |4,259.44
082006|F1150|CMV|MARGARET S     |AAF070|FJG1N|FTG5E|     |        |   
+      |  7.0| .04 |  191.76
082006|F1150|CMV|MARGARET S     |AAF070|FJG1N|FTW5G|     |        |   
+      |  1.0|     |   27.38
082006|F1150|CMV|MARGARET S     |AAF070|FJG1N|FTW5V|     |        |   
+      |  1.0|     |   27.38
082006|F1150|CMV|MARGARET S     |AAF070|FJG1Q|FTG5E|     |        |   
+      |  2.0| .01 |   54.78
082006|F1150|CMV|MARGARET S     |AAF070|FJG1Q|FTG5F|     |        |   
+      |  4.0| .02 |  109.56
082006|F1150|CMV|MARGARET S     |AAF070|FJG1Q|FTW5B|     |        |   
+      |  1.0| .01 |   31.48
082006|F1150|CMV|MARGARET S     |AAF070|FJG1Q|FTW5G|     |        |   
+      |  9.0| .05 |  279.29
082006|F1150|CMV|MARGARET S     |AAF070|FJG1Q|FTW5V|     |        |   
+      |  6.0| .03 |  180.76
082006|F1150|PWC|CARL H.        |AAF049|FJG1B|FTW5F|120.0|     .71| 4,
+226.34|324.0|1.86 |0,868.58
082006|F1150|LWR|KIM            |AAF104|FJO1A|FTO5C|     |        |   
+      | 11.0| .06 |  422.18
082006|F1150|LWR|KIM            |AAF104|FJO1A|FTO5D| 33.0|     .19| 1,
+363.92|127.5| .73 |4,887.53
082006|F1150|LWR|KIM            |AAF104|FJO1A|FTW5E|  5.0|     .03|   
+254.81|  9.0| .05 |  403.18
082006|F1150|LWR|KIM            |AAF104|FJO1A|FTW5G|     |        |   
+      |  1.0| .01 |   37.08
[download]

which isn't formatted exctly as you asked, but you can adjust that to suit your preference/requirements.

(Also, what happened to the KIM lines in your "desired output"?)

The main trick here is to separate the pages into the header and body, so that you can split out the fixed format lines. That allows you to process them using the right tool for the job, unpack.

The other simplification is to treat the fields as an array rather than named entities which makes the substitution process a simple loop.

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

Comment on Re: file parsing help Select or Download Code

Replies are listed 'Best First'.
Re^2: file parsing help by ctaustin (Sexton) on Nov 13, 2006 at 21:30 UTC
Well that is a definately a sexier approach. Seems like it is much more flexible and more robust than mine. As for the Kim records, that was a cut and paste mistake. I appreciate the code, I will take it and tweak it a bit and see what happens. What's going on here? `@temp = unpack 'xa5x2a3x2a15x2a6x2a5x2a5x4a5x9a8x4a9x5a5x15a5x4a9', $line;` [download]	[reply] [d/l]
Re^3: file parsing help by BrowserUk (Patriarch) on Nov 13, 2006 at 22:25 UTC
Take a look at the docs for unpack (also pack as it carries more information). Briefly, the format template `'xa5x2a3x2a15x2a6x2a5x2a5x4a5x9a8x4a9x5a5x15a5x4a9'`, consists of 2 types of format specifier. 'x' & 'xN', which skips forward over 1 or more characters. Used here to skip over the inter-column whitespace. 'aN', which 'captures' N characters. This is used to extract the fixed format fields. The results are assigned into the array `@temp`. Note that the length specifiers I've used are quickly approximated from the example posted, you will want to review the values carefully in the light of your full data. Unlike a regex capture, what is in the bytes captured is irrelevant, it is based entirely upon the character positions (like substr). I've attempted to show how the parsing works below, but the stupid wrap 'feature' of PM means it doesn't really work. `< 0> <1> < 2 > < 3 > < 4 > < 5 > < 6 > xaaaaaxxaaaxxaaaaaaaaaaaaaaaxxaaaaaaxxaaaaaxxaaaaaxxxxaaaaaxxxxxxxxx 0 CDE, DEBORAH M. AAF103 FJB1A FTB5A 3.0 < 7 > < 8 > < 9 > <10 > < 11 > aaaaaaaaxxxxaaaaaaaaaxxxxxaaaaaxxxxxxxxxxxxxxxaaaaaxxxxaaaaaaaaa .02 107.83 3.0 .02 107.83` [download] Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal? "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l] [select]
Re^4: file parsing help by ctaustin (Sexton) on Nov 13, 2006 at 22:32 UTC
Cool approach, I will look into it further and try to fine tune it to the actual data set. Thanks for the pointers.	[reply]
Re^3: file parsing help by planetscape (Chancellor) on Nov 13, 2006 at 22:53 UTC
In addition to BrowserUk's suggestion re: pack and unpack, I also recommend Pack/Unpack Tutorial (aka How the System Stores Data) and perlpacktut. HTH, planetscape	[reply]