ctaustin has asked for the wisdom of the Perl Monks concerning the following question:

I have a mainframe generated flat file that I need to parse for loading into a database via SQL*Loader. The original file looks like:

1SYSTEM: TMS1 STATE OF DISMA +Y DATE: 10/23/06 REPORT: B155A761 DEPARTMENT OF C +ONFUSION TIME: 15:05 EMPL ID: AAF*** ACTUAL FTE'S/TOTAL COST +S BY EMPL NAME PAGE: 1 SPR IDX: F**** FOR PERIOD 07/06 T +HRU 08/06 JOB/TASK: F****/F**** 0PROGRAM (F00) DEPARTMENT OF CONFUSION + ACTUALS THRU 08/06 0SUPER EMPL ---- MONTHS 07/0 +6 THRU 08/06 ------ ---------- BIENNIUM TO DATE ----------- INDEX NAME ID JOB TASK REG OT ST +AFF MO COST REG OT STAFF MO COST 0F1150 ABC, KELLY J. AAF113 FJO1A FTO5A 284.0 + 1.63 6,688.22 735.0 4.23 17,296.52 FJO1A FTO5D 38.0 + .22 893.91 90.0 .52 2,128.73 FJO1A FTW5T 6.0 + .03 135.07 6.0 .03 135.07 0 * ABC, KELLY J. AAF113 328.0 + 1.88 7,717.20 831.0 4.78 19,560.32 0 CDE, DEBORAH M. AAF103 FJB1A FTB5A 3.0 + .02 107.83 3.0 .02 107.83 FJB1A FTB5B + 21.5 .14 881.81 FJB1A FTB5D + 5.5 .03 194.37 FJB1A FTB5G 5.5 + .03 192.11 22.0 .11 790.06 FJB1A FTW5U + 1.0 .01 41.20 FJG1N FTG5C + 17.0 .11 700.26 FJG1N FTG5E 15.5 + .09 557.19 15.5 .09 557.19 FJG1N FTW5A 1.0 + 35.95 1.0 35.95 FJG1N FTW5G + 1.5 .01 61.79 FJG1N FTW5H + 1.0 .01 41.20 FJG1N FTW5T 1.0 + 35.95 3.0 .01 118.34 FJG1N FTW5U + 5.0 .03 205.96 FJG1Q FTG5C + 2.0 .01 70.69 FJG1V FTG5E 64.0 + .33 2,140.75 64.0 .33 2,140.75 FJG2A FTG5C + 2.0 .01 70.69 FJG2A FTW5E + 1.0 .01 41.20 FJG2A FTW5J + 9.0 .05 370.75 FJG2A FTW5T 5.5 + .03 197.72 5.5 .03 197.72 FJO1A FTO5D 219.0 + 1.14 7,587.85 432.0 2.34 15,578.73 FJO1A FTW5E + 1.0 .01 41.20 FJO1A FTW5G 1.0 + 35.95 1.0 35.95 FJO1A FTW5T + 65.5 .37 2,507.55 FJO1A FTW5U + 3.0 .02 106.00 FJO1A FTW5V 34.5 + .19 1,203.74 84.5 .49 3,103.17 FJO1A FTW5W 2.0 + .01 66.30 6.0 .04 219.51 0 * CDE, DEBORAH M. AAF103 352.0 + 1.84 12,161.34 773.5 4.28 28,219.87 0 HIF, CRAIG AAF040 FJB1A FTB5B 145.0 + .82 5,390.09 536.0 3.05 19,574.79 0 CMV, MARGARET S AAF070 FJB1A FTB5B + 138.0 .86 4,259.44 FJG1N FTG5E + 7.0 .04 191.76 FJG1N FTW5G + 1.0 27.38 FJG1N FTW5V + 1.0 27.38 FJG1Q FTG5E + 2.0 .01 54.78 FJG1Q FTG5F + 4.0 .02 109.56 FJG1Q FTW5B + 1.0 .01 31.48 1SYSTEM: TMS1 STATE OF DISMA +Y DATE: 10/23/06 REPORT: B155A761 DEPARTMENT OF C +ONFUSION TIME: 15:05 EMPL ID: AAF*** ACTUAL FTE'S/TOTAL COST +S BY EMPL NAME PAGE: 2 SPR IDX: F**** FOR PERIOD 07/06 T +HRU 08/06 JOB/TASK: F****/F**** 0PROGRAM (F00) DEPARTMENT OF CONFUSION + ACTUALS THRU 08/06 0SUPER EMPL ---- MONTHS 07/0 +6 THRU 08/06 ------ ---------- BIENNIUM TO DATE ----------- INDEX NAME ID JOB TASK REG OT ST +AFF MO COST REG OT STAFF MO COST 0F1150 CMV, MARGARET S AAF070 FJG1Q FTW5G + 9.0 .05 279.29 FJG1Q FTW5V + 6.0 .03 180.76 0 * CMV, MARGARET S AAF070 + 169.0 1.02 5,161.83 0 PWC, CARL H. AAF049 FJG1B FTW5F 120.0 + .71 4,226.34 324.0 1.86 10,868.58 0 LWR, KIM AAF104 FJO1A FTO5C + 11.0 .06 422.18 FJO1A FTO5D 33.0 + .19 1,363.92 127.5 .73 4,887.53 FJO1A FTW5E 5.0 + .03 254.81 9.0 .05 403.18 FJO1A FTW5G + 1.0 .01 37.08 </pre> I am looking to make the output file look like the following, which fi +lls in the data holes. <pre> 082006 F1150 ABC, KELLY J. AAF113 FJO1A FTO5A 284.0 + 1.63 6,688.22 735.0 4.23 17,296.52 082006 F1150 ABC, KELLY J. AAF113 FJO1A FTO5D 38.0 + .22 893.91 90.0 .52 2,128.73 082006 F1150 ABC, KELLY J. AAF113 FJO1A FTW5T 6.0 + .03 135.07 6.0 .03 135.07 082006 CDE, DEBORAH M. AAF103 FJB1A FTB5A 3.0 + .02 107.83 3.0 .02 107.83 082006 F1150 CDE, DEBORAH M. AAF103 FJB1A FTB5B + 21.5 .14 881.81 082006 F1150 CDE, DEBORAH M. AAF103 FJB1A FTB5D + 5.5 .03 194.37 082006 F1150 CDE, DEBORAH M. AAF103 FJB1A FTB5G 5.5 + .03 192.11 22.0 .11 790.06 082006 F1150 CDE, DEBORAH M. AAF103 FJB1A FTW5U + 1.0 .01 41.20 082006 F1150 CDE, DEBORAH M. AAF103 FJG1N FTG5C + 17.0 .11 700.26 082006 F1150 CDE, DEBORAH M. AAF103 FJG1N FTG5E 15.5 + .09 557.19 15.5 .09 557.19 082006 F1150 CDE, DEBORAH M. AAF103 FJG1N FTW5A 1.0 + 35.95 1.0 35.95 082006 F1150 CDE, DEBORAH M. AAF103 FJG1N FTW5G + 1.5 .01 61.79 082006 F1150 CDE, DEBORAH M. AAF103 FJG1N FTW5H + 1.0 .01 41.20 082006 F1150 CDE, DEBORAH M. AAF103 FJG1N FTW5T 1.0 + 35.95 3.0 .01 118.34 082006 F1150 CDE, DEBORAH M. AAF103 FJG1N FTW5U + 5.0 .03 205.96 082006 F1150 CDE, DEBORAH M. AAF103 FJG1Q FTG5C + 2.0 .01 70.69 082006 F1150 CDE, DEBORAH M. AAF103 FJG1V FTG5E 64.0 + .33 2,140.75 64.0 .33 2,140.75 082006 F1150 CDE, DEBORAH M. AAF103 FJG2A FTG5C + 2.0 .01 70.69 082006 F1150 CDE, DEBORAH M. AAF103 FJG2A FTW5E + 1.0 .01 41.20 082006 F1150 CDE, DEBORAH M. AAF103 FJG2A FTW5J + 9.0 .05 370.75 082006 F1150 CDE, DEBORAH M. AAF103 FJG2A FTW5T 5.5 + .03 197.72 5.5 .03 197.72 082006 F1150 CDE, DEBORAH M. AAF103 FJO1A FTO5D 219.0 + 1.14 7,587.85 432.0 2.34 15,578.73 082006 F1150 CDE, DEBORAH M. AAF103 FJO1A FTW5E + 1.0 .01 41.20 082006 F1150 CDE, DEBORAH M. AAF103 FJO1A FTW5G 1.0 + 35.95 1.0 35.95 082006 F1150 CDE, DEBORAH M. AAF103 FJO1A FTW5T + 65.5 .37 2,507.55 082006 F1150 CDE, DEBORAH M. AAF103 FJO1A FTW5U + 3.0 .02 106.00 082006 F1150 CDE, DEBORAH M. AAF103 FJO1A FTW5V 34.5 + .19 1,203.74 84.5 .49 3,103.17 082006 F1150 CDE, DEBORAH M. AAF103 FJO1A FTW5W 2.0 + .01 66.30 6.0 .04 219.51 082006 HIF, CRAIG AAF040 FJB1A FTB5B 145.0 + .82 5,390.09 536.0 3.05 19,574.79 082006 CMV, MARGARET S AAF070 FJB1A FTB5B + 138.0 .86 4,259.44 082006 F1150 CMV, MARGARET S AAF070 FJG1N FTG5E + 7.0 .04 191.76 082006 F1150 CMV, MARGARET S AAF070 FJG1N FTW5G + 1.0 27.38

The problem with this output is that I am missing some data, there are holes in the second field and in the fourth field sometimes as well.
I think I have narrowed it down to the instances when the name changes, although I don't know why.

Really I would like to have this be a pipe delimited file since it is easier to load, however I found that the blank fields were being skipped
which made the numbers shift over into the wrong position.

Here is how I am getting the output I have now.

open (IFILE,$pfile)||&log_it("Unable To Open Input File $pfile $!"); while (my $line=<IFILE>) { $cnt = $cnt+1; if (substr($line,0,7) eq "1SYSTEM") { $header=1; } if ($header <=8) { #skip the 8 lines of the header $header = $header +1; } else { if ($line=~/\x00/||$line=~/\x1A/||$line=~/\x2A/) { #skip the lines that have the hex value of 00, 1A #and skip the summary lines - marked with a * $x=x; } #these lines will have usable data else { #take the first chunck of the line, which is the S +IC, username and ID #break it up into its components by making it pipe + | delimited $sub_line=substr($line,0,38); $sub_line=~s/\s+\s+/\|/g; #get the positions of the pipes $a=index($sub_line,'|',1); $b=index($sub_line,'|',8); $c=index($sub_line,'|',20); #assign the values my $index=substr($sub_line,0,$a); my $username=substr($sub_line,$a+1,$b-($a+1)); my $id=substr($sub_line,$b+1,$c-($b+1)); my $new_index; #get rid of leading 0 in index value $index=~s/^0//; #print "start '$index' ($prev_user,$username): "; # if the index code is 0 or blank, then use # the last known SIC value, padded to 7 charac +ters if ($index eq '' ) { $index=sprintf("%-*s",7,$prev_index); #print "Null Index -> $index\n"; } else { $index=sprintf("%-*s",7,$index); } # if the username is blank then use the last # known username value and pad it to 22 charac +ters # and use the last known user ID padded to 8 c +haracters if (trim($username) eq '') { $username=sprintf("%-*s",22,$prev_user); if ($id eq "") { $id=sprintf("%-*s",8,$prev_id); } else { $id=sprintf("%-*s",8,$prev_id); } # remove the initial white space from the "bla +nk" lines # then reconstruct the output line $line=ltrim($line); $line=~s/^0//; $prev_user=$username; $prev_index=$prev_index; $prev_id=$id; print OFILE $period." ".$index.$username.$id.$ +line } #if the index and username were filled in, the + assign these values #to the previous variables and print the recor +d else { $line=~s/^0//; $prev_user=$username; $prev_index=$index; #$prev_index=~s/^0//; $prev_id=$id; print OFILE $period." ".$line } } } } #end while loop

I am guessing that there is something small that I need to fix. Then again, I am also guessing
that there is a much easier way to achieve this. Any suggestions would be appreciated.

Edited (davorg): replaced pre tags with code tags

Replies are listed 'Best First'.
Re: file parsing help
by BrowserUk (Patriarch) on Nov 13, 2006 at 18:12 UTC

    You might try it this way:

    #! perl -slw use strict; $/ = "\n1SYSTEM"; ## para mode my @fields; while( my $page = <DATA> ) { ## Extract date $page =~ m[ACTUALS THRU (\d\d)/(\d\d)] and my $period = "${1}20${2}" or die "Couldn't get date"; ## Extract body and split into lines ## discarding total(*) lines $page =~ m[COST\n(.+)]s and my @lines = grep{!m[^0.\*] } split "\n", $1 or last; for my $line ( @lines ) { my @temp; ## Split the line into fields. Skip the last line eval{ @temp = unpack 'xa5x2a3x2a15x2a6x2a5x2a5x4a5x9a8x4a9x5a5x15a5x4a9', $line; } or last; ## Fill in th missing fields from previous line $temp[ $_ ] =~ m[^\s+$] and $temp[ $_ ] = $fields[ $_ ] for 0, + 1, 2, 3; ## output formatted appropriately print join '|', $period, @temp; ## Save fields for in-filling. @fields = @temp; } } __DATA__

    With Your input pasted into the DATA section, this produces:

    C:\test>583749 082006|F1150|ABC|KELLY J. |AAF113|FJO1A|FTO5A|284.0| 1.63| 6, +688.22|735.0|4.23 |7,296.52 082006|F1150|ABC|KELLY J. |AAF113|FJO1A|FTO5D| 38.0| .22| +893.91| 90.0| .52 |2,128.73 082006|F1150|ABC|KELLY J. |AAF113|FJO1A|FTW5T| 6.0| .03| +135.07| 6.0| .03 | 135.07 082006|F1150|CDE|DEBORAH M. |AAF103|FJB1A|FTB5A| 3.0| .02| +107.83| 3.0| .02 | 107.83 082006|F1150|CDE|DEBORAH M. |AAF103|FJB1A|FTB5B| | | + | 21.5| .14 | 881.81 082006|F1150|CDE|DEBORAH M. |AAF103|FJB1A|FTB5D| | | + | 5.5| .03 | 194.37 082006|F1150|CDE|DEBORAH M. |AAF103|FJB1A|FTB5G| 5.5| .03| +192.11| 22.0| .11 | 790.06 082006|F1150|CDE|DEBORAH M. |AAF103|FJB1A|FTW5U| | | + | 1.0| .01 | 41.20 082006|F1150|CDE|DEBORAH M. |AAF103|FJG1N|FTG5C| | | + | 17.0| .11 | 700.26 082006|F1150|CDE|DEBORAH M. |AAF103|FJG1N|FTG5E| 15.5| .09| +557.19| 15.5| .09 | 557.19 082006|F1150|CDE|DEBORAH M. |AAF103|FJG1N|FTW5A| 1.0| | + 35.95| 1.0| | 35.95 082006|F1150|CDE|DEBORAH M. |AAF103|FJG1N|FTW5G| | | + | 1.5| .01 | 61.79 082006|F1150|CDE|DEBORAH M. |AAF103|FJG1N|FTW5H| | | + | 1.0| .01 | 41.20 082006|F1150|CDE|DEBORAH M. |AAF103|FJG1N|FTW5T| 1.0| | + 35.95| 3.0| .01 | 118.34 082006|F1150|CDE|DEBORAH M. |AAF103|FJG1N|FTW5U| | | + | 5.0| .03 | 205.96 082006|F1150|CDE|DEBORAH M. |AAF103|FJG1Q|FTG5C| | | + | 2.0| .01 | 70.69 082006|F1150|CDE|DEBORAH M. |AAF103|FJG1V|FTG5E| 64.0| .33| 2, +140.75| 64.0| .33 |2,140.75 082006|F1150|CDE|DEBORAH M. |AAF103|FJG2A|FTG5C| | | + | 2.0| .01 | 70.69 082006|F1150|CDE|DEBORAH M. |AAF103|FJG2A|FTW5E| | | + | 1.0| .01 | 41.20 082006|F1150|CDE|DEBORAH M. |AAF103|FJG2A|FTW5J| | | + | 9.0| .05 | 370.75 082006|F1150|CDE|DEBORAH M. |AAF103|FJG2A|FTW5T| 5.5| .03| +197.72| 5.5| .03 | 197.72 082006|F1150|CDE|DEBORAH M. |AAF103|FJO1A|FTO5D|219.0| 1.14| 7, +587.85|432.0|2.34 |5,578.73 082006|F1150|CDE|DEBORAH M. |AAF103|FJO1A|FTW5E| | | + | 1.0| .01 | 41.20 082006|F1150|CDE|DEBORAH M. |AAF103|FJO1A|FTW5G| 1.0| | + 35.95| 1.0| | 35.95 082006|F1150|CDE|DEBORAH M. |AAF103|FJO1A|FTW5T| | | + | 65.5| .37 |2,507.55 082006|F1150|CDE|DEBORAH M. |AAF103|FJO1A|FTW5U| | | + | 3.0| .02 | 106.00 082006|F1150|CDE|DEBORAH M. |AAF103|FJO1A|FTW5V| 34.5| .19| 1, +203.74| 84.5| .49 |3,103.17 082006|F1150|CDE|DEBORAH M. |AAF103|FJO1A|FTW5W| 2.0| .01| + 66.30| 6.0| .04 | 219.51 082006|F1150|HIF|CRAIG |AAF040|FJB1A|FTB5B|145.0| .82| 5, +390.09|536.0|3.05 |9,574.79 082006|F1150|CMV|MARGARET S |AAF070|FJB1A|FTB5B| | | + |138.0| .86 |4,259.44 082006|F1150|CMV|MARGARET S |AAF070|FJG1N|FTG5E| | | + | 7.0| .04 | 191.76 082006|F1150|CMV|MARGARET S |AAF070|FJG1N|FTW5G| | | + | 1.0| | 27.38 082006|F1150|CMV|MARGARET S |AAF070|FJG1N|FTW5V| | | + | 1.0| | 27.38 082006|F1150|CMV|MARGARET S |AAF070|FJG1Q|FTG5E| | | + | 2.0| .01 | 54.78 082006|F1150|CMV|MARGARET S |AAF070|FJG1Q|FTG5F| | | + | 4.0| .02 | 109.56 082006|F1150|CMV|MARGARET S |AAF070|FJG1Q|FTW5B| | | + | 1.0| .01 | 31.48 082006|F1150|CMV|MARGARET S |AAF070|FJG1Q|FTW5G| | | + | 9.0| .05 | 279.29 082006|F1150|CMV|MARGARET S |AAF070|FJG1Q|FTW5V| | | + | 6.0| .03 | 180.76 082006|F1150|PWC|CARL H. |AAF049|FJG1B|FTW5F|120.0| .71| 4, +226.34|324.0|1.86 |0,868.58 082006|F1150|LWR|KIM |AAF104|FJO1A|FTO5C| | | + | 11.0| .06 | 422.18 082006|F1150|LWR|KIM |AAF104|FJO1A|FTO5D| 33.0| .19| 1, +363.92|127.5| .73 |4,887.53 082006|F1150|LWR|KIM |AAF104|FJO1A|FTW5E| 5.0| .03| +254.81| 9.0| .05 | 403.18 082006|F1150|LWR|KIM |AAF104|FJO1A|FTW5G| | | + | 1.0| .01 | 37.08

    which isn't formatted exctly as you asked, but you can adjust that to suit your preference/requirements.

    (Also, what happened to the KIM lines in your "desired output"?)

    The main trick here is to separate the pages into the header and body, so that you can split out the fixed format lines. That allows you to process them using the right tool for the job, unpack.

    The other simplification is to treat the fields as an array rather than named entities which makes the substitution process a simple loop.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Well that is a definately a sexier approach. Seems like it is much more flexible and more robust than mine.
      As for the Kim records, that was a cut and paste mistake.

      I appreciate the code, I will take it and tweak it a bit and see what happens.

      What's going on here?
      @temp = unpack 'xa5x2a3x2a15x2a6x2a5x2a5x4a5x9a8x4a9x5a5x15a5x4a9', $line;

        Take a look at the docs for unpack (also pack as it carries more information).

        Briefly, the format template 'xa5x2a3x2a15x2a6x2a5x2a5x4a5x9a8x4a9x5a5x15a5x4a9', consists of 2 types of format specifier.

        1. 'x' & 'xN', which skips forward over 1 or more characters.

          Used here to skip over the inter-column whitespace.

        2. 'aN', which 'captures' N characters.

          This is used to extract the fixed format fields.

        The results are assigned into the array @temp. Note that the length specifiers I've used are quickly approximated from the example posted, you will want to review the values carefully in the light of your full data.

        Unlike a regex capture, what is in the bytes captured is irrelevant, it is based entirely upon the character positions (like substr). I've attempted to show how the parsing works below, but the stupid wrap 'feature' of PM means it doesn't really work.

        < 0> <1> < 2 > < 3 > < 4 > < 5 > < 6 > xaaaaaxxaaaxxaaaaaaaaaaaaaaaxxaaaaaaxxaaaaaxxaaaaaxxxxaaaaaxxxxxxxxx 0 CDE, DEBORAH M. AAF103 FJB1A FTB5A 3.0 < 7 > < 8 > < 9 > <10 > < 11 > aaaaaaaaxxxxaaaaaaaaaxxxxxaaaaaxxxxxxxxxxxxxxxaaaaaxxxxaaaaaaaaa .02 107.83 3.0 .02 107.83

        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: file parsing help
by liverpole (Monsignor) on Nov 13, 2006 at 16:59 UTC
    Hi ctaustin,

    Your program doesn't quite work as you've presented it:

    1. You haven't specified the name of the file "$pfile".
    2. There's a "Bareword" error in $x=x;
    3. You haven't supplied the subroutine trim

    Additionally, I'd highly recommend you use strict and warnings at the top of the program.  This will make it necessary to declare your variables before you use them.  For example:

    use strict; use warnings; # Global variables my $cnt = 0; my $header; my $sub_line; my $prev_index; my $x; my $c; my $prev_user; my $prev_id; my $period;

    Finally, it would be a help if you could provide an example of what output you'd like to see as well.


    s''(q.S:$/9=(T1';s;(..)(..);$..=substr+crypt($1,$2),2,3;eg;print$..$/
Re: file parsing help
by ikegami (Patriarch) on Nov 13, 2006 at 17:01 UTC

    if ($index eq '' )
    should be
    if (trim($index) eq '' )

    Also, something's fishy about

    if ($id eq "") { $id=sprintf("%-*s",8,$prev_id); } else { $id=sprintf("%-*s",8,$prev_id); }

    By the way, please use <c>...</c> instead of <pre>...</pre> on PerlMonks.

    Update:

    $a=index($sub_line,'|',1); $b=index($sub_line,'|',8); $c=index($sub_line,'|',20);

    should be

    $a=index($sub_line,'|'); $b=index($sub_line,'|', $a+1); $c=index($sub_line,'|', $b+1);

    Although you shouldn't use $a and $b. They are special variables, and using them can create problems at a distance.

      Good catch on the Also, something's fishy about
      if ($id eq "") { $id=sprintf("%-*s",8,$prev_id); } else { $id=sprintf("%-*s",8,$prev_id); }
      This fixed the problem I had with the holes in the fourth field.
      I have also changed the variables names away from $a and $b.

      I am still getting missing values for the SIC code in the second field, after adding in the trim($index) eq '' ...

      I appreciate the feedback thus far.

        Also, something's fishy about

        This fixed the problem I had with the holes in the fourth field.

        The "then" and "else" part are identical.

        if ($id eq "") { $id=sprintf("%-*s",8,$prev_id); } else { $id=sprintf("%-*s",8,$prev_id); }

        is the same thing as

        $id=sprintf("%-*s",8,$prev_id);

        I am still getting missing values for the SIC code in the second field

        The problem occurs for the first record of every employee except the first employee. That should be easy to debug.