in reply to Parsing large text file with perl

A loop with lots of flags.
This builds a hash. It loads _all_ the data. You would want to filter out fixed info you don't want. Also, depending on the data you would expect, you would also want to make sure the regexs are tight enough (these are very loose). As you can see, it is in 'verbose' idiom.
I tried it with 2 records and with more than 1 history field.
#!/bin/perl5 use strict; use warnings; my %hash; my $true = 1; my $false = 0; my ($header, $history, $footer, @fields ); my ($vendor, $i ); while (<DATA>){ chomp; next if /^$/; if (/^VENDOR/ and /PAGE/){ $header = $true; $footer = $false; @fields = split; $vendor = $fields[1]; push @{$hash{$vendor}{'header'}} , @fields; $i = 0; next; } elsif ( /AWARD\sHISTORY/ ){ $header = $false; $history = $true; next; } elsif ( /PID/ ){ $history = $false; $footer = $true; next; } if ( $history ){ @fields = split; my $history_row = join '', 'history ', $i; push @{$hash{$vendor}{$history_row}}, @fields; $i++; } elsif ( $footer ){ # bug fix, was $header @fields = split; push @{$hash{$vendor}{'footer'}}, @fields; } } open my $out, '>', 'parse.txt'; for my $v ( keys %hash ){ print $out "vendor: $v\n"; for my $rec ( keys %{$hash{$v}} ){ print $out "\trecord:\t$rec\n"; print $out "\t\t"; for my $fld ( @{$hash{$v}{$rec}} ){ print $out "$fld\t"; } print $out "\n"; } } close $out; __DATA__ VENDOR 61125 TOTAL DOLLAR VAR 77,097.60 PAGE 1 2003 08 01 VENDOR SIS UNIT BASE SHIP TOT DOL DOLLAR PERCENT CONTRACT NUMBER PRICE PRICE QTY U/I DATE + PR NUMBER BIN/PART NUMBER VALUE VARIANCE VARIANCE YT67DY7898DUFT5126 88.20000 70.00000 50 EA 0000000 +0 POI90809819856 1560007117067 4,410.00 910.00 0 AWARD HISTORY PIIN BSCM N/A U/I UNIT PRI +CE AWD DT QTY OPT DT FOB REP TYPE 765WTY34TF56A 7J777 N EA 39.5 +5000 93012 147 00000 2 Y B PID DATA LINE NR + LINE NR 01 001PART, DESCRIPTION, DATA + 02 002TECHNICAL DATA AVAILABILITY: 03 003 VENDOR 61126 TOTAL DOLLAR VAR 77,097.60 PAGE 1 2003 08 01 VENDOR SIS UNIT BASE SHIP TOT DOL DOLLAR PERCENT CONTRACT NUMBER PRICE PRICE QTY U/I DATE + PR NUMBER BIN/PART NUMBER VALUE VARIANCE VARIANCE YT67DY7898DUFT5126 88.20000 70.00000 50 EA 0000000 +0 POI90809819856 1560007117067 4,410.00 910.00 0 AWARD HISTORY PIIN BSCM N/A U/I UNIT PRI +CE AWD DT QTY OPT DT FOB REP TYPE 765WTY34TF56A 7J777 N EA 39.5 +5000 93012 147 00000 2 Y B 765WTY34TF56B 7J777 N EA 39.5 +5000 93012 147 00000 2 Y B 765WTY34TF56C 7J777 N EA 39.5 +5000 93012 147 00000 2 Y B PID DATA LINE NR + LINE NR 01 001PART, DESCRIPTION, DATA + 02 002TECHNICAL DATA AVAILABILITY: 03 003
produces..
vendor: 61125 record: history 0 765WTY34TF56A 7J777 N EA 39.55000 93012 147... record: footer 01 001PART, DESCRIPTION, DATA 02 002TECHNICAL... record: header VENDOR 61125 TOTAL DOLLAR VAR 77,097.60 PAGE... vendor: 61126 record: history 0 765WTY34TF56A 7J777 N EA 39.55000 93012 147... record: history 2 765WTY34TF56C 7J777 N EA 39.55000 93012 147... record: history 1 765WTY34TF56B 7J777 N EA 39.55000 93012 147... record: footer 01 001PART, DESCRIPTION, DATA 02 002TECHNICAL... record: header VENDOR 61126 TOTAL DOLLAR VAR 77,097.60 PAGE...
Update: added output
Update2: Fixed bug! Footer wasn't stored.
Update3: Truncated and formated the output (tabs were a bad idea)

Replies are listed 'Best First'.
Re^2: Parsing large text file with perl
by maida (Initiate) on Sep 02, 2004 at 03:27 UTC
    Thanks for all the help.... I am almost there with this one. When I get down to the individual histories and need to parse each line how do I handle missing data? For example:
    AP040003EZ9891783 61125 N BX 108.0 +0000 03196 00000 D Y B BP041303DD554 009J0 N BX 8.7 +5000 03168 62 00000 Y W
    I was trying to split on space and then populate an array, but then the colums are messed up.... If a value is missing I get errors like x6 being saved in to the spot for x5. Thanks again, -Shawn
      This looks like a fixed length record. You could use unpack. While the following 'demonstrates' the idea it's probably not the best use of unpack (e.g. there's a floating point number). Your best bet would be to ask another question to find an elegant use of unpack.
      #!/bin/perl5 use strict; use warnings; my @history = <DATA>; for my $record (@history){ print "$record\n"; $record =~ s/^\s*//; my @fields = unpack "a21 a9 a9 a2 a13 a8 a9 a9 a4 a5 a6", $record; for my $field (@fields){ print "*$field*\n"; } } __DATA__ AP040003EZ9891783 61125 N BX 108.0 +0000 03196 00000 D Y B BP041303DD554 009J0 N BX 8.7 +5000 03168 62 00000 Y W
      When you've cracked it you could probably apply it to the other records as well.

        Thank you all for the help... Hopefully my last question. Here is the code that at the very least seperates the data out.
        #!/usr/bin/perl use strict; use warnings; my $true = 1; my $false = 0; my ($header, $history, $footer, @fields); my ($vendor, $i); my $file = "AUG.txt"; my @FILE; my $vendor_id = 0; my @VENDORS; my @CONTRACTS; my $contract_id = 0; my @AWARDS; open (INFILE, $file); @FILE = <INFILE>; close (INFILE); foreach (@FILE){ chomp; next if /^$/; if (/VENDOR.+PAGE/){ @fields = split; $vendor_id++; #push @VENDORS,"$vendor_id $fields[1]\n"; print "\n\nVENDOR \= $fields[1]\n"; next; } elsif (/\s+?\S{17}\s+?\S+?\./){ #push @CONTRACTS,"$vendor_id $_\n"; @fields = split; print " CONTRACT NUMBER \= $fields[0]\n"; print " VENDOR PRICE \= $fields[1]\n"; print " BASE PRICE \= $fields[2]\n"; print " QTY \= $fields[3]\n"; print " SHIP DATE \= $fields[4]\n"; print " PR NUMBER \= $fields[5]\n"; print " ARR NUMBER \= $fields[6]\n"; print " DOLLAR VALUE \= $fields[7]\n"; print " DOLLAR VARIENCE \= $fields[8]\n"; print " PERCENT VARIANCE \= $fields[9]\n"; print "\n"; next; } elsif (/^\s+?\S{13}\s+?\S+?\s+?\S/){ #print "$_\n"; $_ =~ s/^\s*//; my @fields = unpack "a21 a9 a9 a2 a13 a8 a9 a9 a4 a5 a6", $_; print " PIIN \= $fields[0]\n"; print " FSCM \= $fields[1]\n"; print " N/A \= $fields[2]\n"; print " U/I \= $fields[3]\n"; print " UNIT PRICE \= $fields[4]\n"; print " AWD DT \= $fields[5]\n"; print " QTY \= $fields[6]\n"; print " OPT DT \= $fields[7]\n"; print " FOB \= $fields[8]\n"; print " REP \= $fields[9]\n"; print " TYPE \= $fields[10]\n"; print "\n"; } else{ $_ =~ s/^\s*//; if (/^\d{2}\s\d{3}/){ print "$_\n"; } } }
        Part of the out put:
        VENDOR = 1NWV5 CONTRACT NUMBER = AAB40003VG880MODF VENDOR PRICE = 3.25000 BASE PRICE = 0.76000 QTY = 34 SHIP DATE = EA PR NUMBER = 00000000 ARR NUMBER = YPG03188000386 DOLLAR VALUE = 3110009197232 DOLLAR VARIENCE = 110.50 PERCENT VARIANCE = 84.66 PIIN = CFS50080P7291 FSCM = 5N366 N/A = N U/I = EA UNIT PRICE = 0.30000 AWD DT = 80004 QTY = 6,600 OPT DT = 00000 FOB = D REP = Y TYPE = B 01 001ROLLER,NEEDLE 02 002DIV GENERAL MOTORS CORP 03 003PAGE 73342 04 004P/N 2275468 05 005IDENTIFY TO: 06 006 07 007
        Ofcourse keeping in mind that both the contract information and the history information can repeat any number of times per vendor. Now I need to somehow create a data structure that will allow me to easily read the data back out and make database inserts. Here is how the data is related:
        VENDOR = 1NWV5 FOREACH VENDOR LIST OF CONTRACTS FOREACH CONTRACT LIST OF CONTRACT INFORMATION LIST OF AWARDS FOREACH AWARD LIST OF AWARD INFORMATION CONTRACT DESCRIPTION [The three or four lines after the h +istory - This is getting dumped in a big text field in the database.]
        From the looks of it I would have an Array of Vendors containing an Array of Contracts containg two Hashes (Contract Information and Contract Description) and an Array of Hashes. What I just said doesn't even make since to me. So hopefully you can put it in perspective or suggest an easier way. As i need to be able to pull the data back out of the structure. Thanks again -Shawn