in reply to file parsing help

You might try it this way:

#! perl -slw use strict; $/ = "\n1SYSTEM"; ## para mode my @fields; while( my $page = <DATA> ) { ## Extract date $page =~ m[ACTUALS THRU (\d\d)/(\d\d)] and my $period = "${1}20${2}" or die "Couldn't get date"; ## Extract body and split into lines ## discarding total(*) lines $page =~ m[COST\n(.+)]s and my @lines = grep{!m[^0.\*] } split "\n", $1 or last; for my $line ( @lines ) { my @temp; ## Split the line into fields. Skip the last line eval{ @temp = unpack 'xa5x2a3x2a15x2a6x2a5x2a5x4a5x9a8x4a9x5a5x15a5x4a9', $line; } or last; ## Fill in th missing fields from previous line $temp[ $_ ] =~ m[^\s+$] and $temp[ $_ ] = $fields[ $_ ] for 0, + 1, 2, 3; ## output formatted appropriately print join '|', $period, @temp; ## Save fields for in-filling. @fields = @temp; } } __DATA__

With Your input pasted into the DATA section, this produces:

C:\test>583749 082006|F1150|ABC|KELLY J. |AAF113|FJO1A|FTO5A|284.0| 1.63| 6, +688.22|735.0|4.23 |7,296.52 082006|F1150|ABC|KELLY J. |AAF113|FJO1A|FTO5D| 38.0| .22| +893.91| 90.0| .52 |2,128.73 082006|F1150|ABC|KELLY J. |AAF113|FJO1A|FTW5T| 6.0| .03| +135.07| 6.0| .03 | 135.07 082006|F1150|CDE|DEBORAH M. |AAF103|FJB1A|FTB5A| 3.0| .02| +107.83| 3.0| .02 | 107.83 082006|F1150|CDE|DEBORAH M. |AAF103|FJB1A|FTB5B| | | + | 21.5| .14 | 881.81 082006|F1150|CDE|DEBORAH M. |AAF103|FJB1A|FTB5D| | | + | 5.5| .03 | 194.37 082006|F1150|CDE|DEBORAH M. |AAF103|FJB1A|FTB5G| 5.5| .03| +192.11| 22.0| .11 | 790.06 082006|F1150|CDE|DEBORAH M. |AAF103|FJB1A|FTW5U| | | + | 1.0| .01 | 41.20 082006|F1150|CDE|DEBORAH M. |AAF103|FJG1N|FTG5C| | | + | 17.0| .11 | 700.26 082006|F1150|CDE|DEBORAH M. |AAF103|FJG1N|FTG5E| 15.5| .09| +557.19| 15.5| .09 | 557.19 082006|F1150|CDE|DEBORAH M. |AAF103|FJG1N|FTW5A| 1.0| | + 35.95| 1.0| | 35.95 082006|F1150|CDE|DEBORAH M. |AAF103|FJG1N|FTW5G| | | + | 1.5| .01 | 61.79 082006|F1150|CDE|DEBORAH M. |AAF103|FJG1N|FTW5H| | | + | 1.0| .01 | 41.20 082006|F1150|CDE|DEBORAH M. |AAF103|FJG1N|FTW5T| 1.0| | + 35.95| 3.0| .01 | 118.34 082006|F1150|CDE|DEBORAH M. |AAF103|FJG1N|FTW5U| | | + | 5.0| .03 | 205.96 082006|F1150|CDE|DEBORAH M. |AAF103|FJG1Q|FTG5C| | | + | 2.0| .01 | 70.69 082006|F1150|CDE|DEBORAH M. |AAF103|FJG1V|FTG5E| 64.0| .33| 2, +140.75| 64.0| .33 |2,140.75 082006|F1150|CDE|DEBORAH M. |AAF103|FJG2A|FTG5C| | | + | 2.0| .01 | 70.69 082006|F1150|CDE|DEBORAH M. |AAF103|FJG2A|FTW5E| | | + | 1.0| .01 | 41.20 082006|F1150|CDE|DEBORAH M. |AAF103|FJG2A|FTW5J| | | + | 9.0| .05 | 370.75 082006|F1150|CDE|DEBORAH M. |AAF103|FJG2A|FTW5T| 5.5| .03| +197.72| 5.5| .03 | 197.72 082006|F1150|CDE|DEBORAH M. |AAF103|FJO1A|FTO5D|219.0| 1.14| 7, +587.85|432.0|2.34 |5,578.73 082006|F1150|CDE|DEBORAH M. |AAF103|FJO1A|FTW5E| | | + | 1.0| .01 | 41.20 082006|F1150|CDE|DEBORAH M. |AAF103|FJO1A|FTW5G| 1.0| | + 35.95| 1.0| | 35.95 082006|F1150|CDE|DEBORAH M. |AAF103|FJO1A|FTW5T| | | + | 65.5| .37 |2,507.55 082006|F1150|CDE|DEBORAH M. |AAF103|FJO1A|FTW5U| | | + | 3.0| .02 | 106.00 082006|F1150|CDE|DEBORAH M. |AAF103|FJO1A|FTW5V| 34.5| .19| 1, +203.74| 84.5| .49 |3,103.17 082006|F1150|CDE|DEBORAH M. |AAF103|FJO1A|FTW5W| 2.0| .01| + 66.30| 6.0| .04 | 219.51 082006|F1150|HIF|CRAIG |AAF040|FJB1A|FTB5B|145.0| .82| 5, +390.09|536.0|3.05 |9,574.79 082006|F1150|CMV|MARGARET S |AAF070|FJB1A|FTB5B| | | + |138.0| .86 |4,259.44 082006|F1150|CMV|MARGARET S |AAF070|FJG1N|FTG5E| | | + | 7.0| .04 | 191.76 082006|F1150|CMV|MARGARET S |AAF070|FJG1N|FTW5G| | | + | 1.0| | 27.38 082006|F1150|CMV|MARGARET S |AAF070|FJG1N|FTW5V| | | + | 1.0| | 27.38 082006|F1150|CMV|MARGARET S |AAF070|FJG1Q|FTG5E| | | + | 2.0| .01 | 54.78 082006|F1150|CMV|MARGARET S |AAF070|FJG1Q|FTG5F| | | + | 4.0| .02 | 109.56 082006|F1150|CMV|MARGARET S |AAF070|FJG1Q|FTW5B| | | + | 1.0| .01 | 31.48 082006|F1150|CMV|MARGARET S |AAF070|FJG1Q|FTW5G| | | + | 9.0| .05 | 279.29 082006|F1150|CMV|MARGARET S |AAF070|FJG1Q|FTW5V| | | + | 6.0| .03 | 180.76 082006|F1150|PWC|CARL H. |AAF049|FJG1B|FTW5F|120.0| .71| 4, +226.34|324.0|1.86 |0,868.58 082006|F1150|LWR|KIM |AAF104|FJO1A|FTO5C| | | + | 11.0| .06 | 422.18 082006|F1150|LWR|KIM |AAF104|FJO1A|FTO5D| 33.0| .19| 1, +363.92|127.5| .73 |4,887.53 082006|F1150|LWR|KIM |AAF104|FJO1A|FTW5E| 5.0| .03| +254.81| 9.0| .05 | 403.18 082006|F1150|LWR|KIM |AAF104|FJO1A|FTW5G| | | + | 1.0| .01 | 37.08

which isn't formatted exctly as you asked, but you can adjust that to suit your preference/requirements.

(Also, what happened to the KIM lines in your "desired output"?)

The main trick here is to separate the pages into the header and body, so that you can split out the fixed format lines. That allows you to process them using the right tool for the job, unpack.

The other simplification is to treat the fields as an array rather than named entities which makes the substitution process a simple loop.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Replies are listed 'Best First'.
Re^2: file parsing help
by ctaustin (Sexton) on Nov 13, 2006 at 21:30 UTC
    Well that is a definately a sexier approach. Seems like it is much more flexible and more robust than mine.
    As for the Kim records, that was a cut and paste mistake.

    I appreciate the code, I will take it and tweak it a bit and see what happens.

    What's going on here?
    @temp = unpack 'xa5x2a3x2a15x2a6x2a5x2a5x4a5x9a8x4a9x5a5x15a5x4a9', $line;

      Take a look at the docs for unpack (also pack as it carries more information).

      Briefly, the format template 'xa5x2a3x2a15x2a6x2a5x2a5x4a5x9a8x4a9x5a5x15a5x4a9', consists of 2 types of format specifier.

      1. 'x' & 'xN', which skips forward over 1 or more characters.

        Used here to skip over the inter-column whitespace.

      2. 'aN', which 'captures' N characters.

        This is used to extract the fixed format fields.

      The results are assigned into the array @temp. Note that the length specifiers I've used are quickly approximated from the example posted, you will want to review the values carefully in the light of your full data.

      Unlike a regex capture, what is in the bytes captured is irrelevant, it is based entirely upon the character positions (like substr). I've attempted to show how the parsing works below, but the stupid wrap 'feature' of PM means it doesn't really work.

      < 0> <1> < 2 > < 3 > < 4 > < 5 > < 6 > xaaaaaxxaaaxxaaaaaaaaaaaaaaaxxaaaaaaxxaaaaaxxaaaaaxxxxaaaaaxxxxxxxxx 0 CDE, DEBORAH M. AAF103 FJB1A FTB5A 3.0 < 7 > < 8 > < 9 > <10 > < 11 > aaaaaaaaxxxxaaaaaaaaaxxxxxaaaaaxxxxxxxxxxxxxxxaaaaaxxxxaaaaaaaaa .02 107.83 3.0 .02 107.83

      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        Cool approach, I will look into it further and try to fine tune it to the actual data set. Thanks for the pointers.