Re: Parsing large text file with perl

A loop with lots of flags.
This builds a hash. It loads _all_ the data. You would want to filter out fixed info you don't want. Also, depending on the data you would expect, you would also want to make sure the regexs are tight enough (these are very loose). As you can see, it is in 'verbose' idiom.
I tried it with 2 records and with more than 1 history field.

#!/bin/perl5

use strict;
use warnings;

my %hash;
my $true = 1;
my $false = 0;
my ($header, $history, $footer, @fields );
my ($vendor, $i );

while (<DATA>){
  chomp;
  next if /^$/;
  if (/^VENDOR/ and /PAGE/){
    $header = $true;
    $footer = $false;
    @fields = split;
    $vendor = $fields[1];
    push @{$hash{$vendor}{'header'}} , @fields;
    $i = 0;
    next;
  }
  elsif ( /AWARD\sHISTORY/ ){
    $header = $false;
    $history = $true;
    next;
  }
  elsif ( /PID/ ){
    $history = $false;
    $footer = $true;
    next;
  }
  if ( $history ){
    @fields = split;
    my $history_row = join '', 'history ', $i;
    push @{$hash{$vendor}{$history_row}}, @fields;
    $i++;
  }
  elsif ( $footer ){ # bug fix, was $header
    @fields = split;
    push @{$hash{$vendor}{'footer'}}, @fields;
  }
}
open my $out, '>', 'parse.txt';
for my $v ( keys %hash ){
  print $out "vendor: $v\n";
  for my $rec ( keys %{$hash{$v}} ){
    print $out "\trecord:\t$rec\n";
    print $out "\t\t";
    for my $fld ( @{$hash{$v}{$rec}} ){
      print $out "$fld\t";
    }
    print $out "\n";
  }
}
close $out;
      

__DATA__
VENDOR 61125 TOTAL DOLLAR VAR 77,097.60  PAGE 1  2003 08 01

VENDOR  SIS  UNIT BASE SHIP TOT DOL DOLLAR  PERCENT
   CONTRACT NUMBER          PRICE      PRICE     QTY  U/I     DATE    
+   PR NUMBER    BIN/PART NUMBER    VALUE  VARIANCE  VARIANCE

   YT67DY7898DUFT5126      88.20000     70.00000      50  EA   0000000
+0  POI90809819856    1560007117067    4,410.00     910.00     0


    AWARD HISTORY   PIIN                BSCM   N/A      U/I   UNIT PRI
+CE  AWD DT      QTY   OPT DT  FOB  REP   TYPE

                    765WTY34TF56A        7J777    N        EA     39.5
+5000   93012      147    00000   2    Y     B

   PID  DATA   LINE NR                                                
+     LINE NR
                 01 001PART, DESCRIPTION, DATA                        
+    02 002TECHNICAL DATA AVAILABILITY:
                 03 003

VENDOR 61126 TOTAL DOLLAR VAR 77,097.60  PAGE 1  2003 08 01

VENDOR  SIS  UNIT BASE SHIP TOT DOL DOLLAR  PERCENT
   CONTRACT NUMBER          PRICE      PRICE     QTY  U/I     DATE    
+   PR NUMBER    BIN/PART NUMBER    VALUE  VARIANCE  VARIANCE

   YT67DY7898DUFT5126      88.20000     70.00000      50  EA   0000000
+0  POI90809819856    1560007117067    4,410.00     910.00     0


    AWARD HISTORY   PIIN                BSCM   N/A      U/I   UNIT PRI
+CE  AWD DT      QTY   OPT DT  FOB  REP   TYPE

                    765WTY34TF56A        7J777    N        EA     39.5
+5000   93012      147    00000   2    Y     B
                    765WTY34TF56B        7J777    N        EA     39.5
+5000   93012      147    00000   2    Y     B
                    765WTY34TF56C        7J777    N        EA     39.5
+5000   93012      147    00000   2    Y     B

   PID  DATA   LINE NR                                                
+     LINE NR
                 01 001PART, DESCRIPTION, DATA                        
+    02 002TECHNICAL DATA AVAILABILITY:
                 03 003
[download]

produces..

vendor: 61125
  record: history 0
    765WTY34TF56A 7J777 N EA 39.55000 93012 147...
  record: footer
    01 001PART, DESCRIPTION, DATA 02 002TECHNICAL...
  record: header
    VENDOR 61125 TOTAL DOLLAR VAR 77,097.60 PAGE...
vendor: 61126
  record: history 0
    765WTY34TF56A 7J777 N EA 39.55000 93012 147...
  record: history 2
    765WTY34TF56C 7J777 N EA 39.55000 93012 147...
  record: history 1
    765WTY34TF56B 7J777 N EA 39.55000 93012 147...
  record: footer
    01 001PART, DESCRIPTION, DATA 02 002TECHNICAL...
  record: header
    VENDOR 61126 TOTAL DOLLAR VAR 77,097.60 PAGE...
[download]

Update: added output
Update2: Fixed bug! Footer wasn't stored.
Update3: Truncated and formated the output (tabs were a bad idea)

Comment on Re: Parsing large text file with perl Select or Download Code

Replies are listed 'Best First'.
Re^2: Parsing large text file with perl by maida (Initiate) on Sep 02, 2004 at 03:27 UTC
Thanks for all the help.... I am almost there with this one. When I get down to the individual histories and need to parse each line how do I handle missing data? For example: `AP040003EZ9891783 61125 N BX 108.0 +0000 03196 00000 D Y B BP041303DD554 009J0 N BX 8.7 +5000 03168 62 00000 Y W` [download] I was trying to split on space and then populate an array, but then the colums are messed up.... If a value is missing I get errors like x6 being saved in to the spot for x5. Thanks again, -Shawn	[reply] [d/l]
Re^3: Parsing large text file with perl by wfsp (Abbot) on Sep 02, 2004 at 06:15 UTC
This looks like a fixed length record. You could use `unpack`. While the following 'demonstrates' the idea it's probably not the best use of unpack (e.g. there's a floating point number). Your best bet would be to ask another question to find an elegant use of `unpack`. `#!/bin/perl5 use strict; use warnings; my @history = <DATA>; for my $record (@history){ print "$record\n"; $record =~ s/^\s//; my @fields = unpack "a21 a9 a9 a2 a13 a8 a9 a9 a4 a5 a6", $record; for my $field (@fields){ print "$field*\n"; } } __DATA__ AP040003EZ9891783 61125 N BX 108.0 +0000 03196 00000 D Y B BP041303DD554 009J0 N BX 8.7 +5000 03168 62 00000 Y W` [download] When you've cracked it you could probably apply it to the other records as well.	[reply] [d/l] [select]
Re^4: Parsing large text file with perl by maida (Initiate) on Sep 03, 2004 at 03:19 UTC
Thank you all for the help... Hopefully my last question. Here is the code that at the very least seperates the data out. #!/usr/bin/perl use strict; use warnings; my $true = 1; my $false = 0; my ($header, $history, $footer, @fields); my ($vendor, $i); my $file = "AUG.txt"; my @FILE; my $vendor_id = 0; my @VENDORS; my @CONTRACTS; my $contract_id = 0; my @AWARDS; open (INFILE, $file); @FILE = <INFILE>; close (INFILE); foreach (@FILE){ chomp; next if /^$/; if (/VENDOR.+PAGE/){ @fields = split; $vendor_id++; #push @VENDORS,"$vendor_id $fields[1]\n"; print "\n\nVENDOR \= $fields[1]\n"; next; } elsif (/\s+?\S{17}\s+?\S+?\./){ #push @CONTRACTS,"$vendor_id $_\n"; @fields = split; print " CONTRACT NUMBER \= $fields[0]\n"; print " VENDOR PRICE \= $fields[1]\n"; print " BASE PRICE \= $fields[2]\n"; print " QTY \= $fields[3]\n"; print " SHIP DATE \= $fields[4]\n"; print " PR NUMBER \= $fields[5]\n"; print " ARR NUMBER \= $fields[6]\n"; print " DOLLAR VALUE \= $fields[7]\n"; print " DOLLAR VARIENCE \= $fields[8]\n"; print " PERCENT VARIANCE \= $fields[9]\n"; print "\n"; next; } elsif (/^\s+?\S{13}\s+?\S+?\s+?\S/){ #print "$_\n"; $_ =~ s/^\s//; my @fields = unpack "a21 a9 a9 a2 a13 a8 a9 a9 a4 a5 a6", $_; print " PIIN \= $fields[0]\n"; print " FSCM \= $fields[1]\n"; print " N/A \= $fields[2]\n"; print " U/I \= $fields[3]\n"; print " UNIT PRICE \= $fields[4]\n"; print " AWD DT \= $fields[5]\n"; print " QTY \= $fields[6]\n"; print " OPT DT \= $fields[7]\n"; print " FOB \= $fields[8]\n"; print " REP \= $fields[9]\n"; print " TYPE \= $fields[10]\n"; print "\n"; } else{ $_ =~ s/^\s//; if (/^\d{2}\s\d{3}/){ print "$_\n"; } } } [download] Part of the out put: VENDOR = 1NWV5 CONTRACT NUMBER = AAB40003VG880MODF VENDOR PRICE = 3.25000 BASE PRICE = 0.76000 QTY = 34 SHIP DATE = EA PR NUMBER = 00000000 ARR NUMBER = YPG03188000386 DOLLAR VALUE = 3110009197232 DOLLAR VARIENCE = 110.50 PERCENT VARIANCE = 84.66 PIIN = CFS50080P7291 FSCM = 5N366 N/A = N U/I = EA UNIT PRICE = 0.30000 AWD DT = 80004 QTY = 6,600 OPT DT = 00000 FOB = D REP = Y TYPE = B 01 001ROLLER,NEEDLE 02 002DIV GENERAL MOTORS CORP 03 003PAGE 73342 04 004P/N 2275468 05 005IDENTIFY TO: 06 006 07 007 [download] Ofcourse keeping in mind that both the contract information and the history information can repeat any number of times per vendor. Now I need to somehow create a data structure that will allow me to easily read the data back out and make database inserts. Here is how the data is related: `VENDOR = 1NWV5 FOREACH VENDOR LIST OF CONTRACTS FOREACH CONTRACT LIST OF CONTRACT INFORMATION LIST OF AWARDS FOREACH AWARD LIST OF AWARD INFORMATION CONTRACT DESCRIPTION [The three or four lines after the h +istory - This is getting dumped in a big text field in the database.]` [download] From the looks of it I would have an Array of Vendors containing an Array of Contracts containg two Hashes (Contract Information and Contract Description) and an Array of Hashes. What I just said doesn't even make since to me. So hopefully you can put it in perspective or suggest an easier way. As i need to be able to pull the data back out of the structure. Thanks again -Shawn	[reply] [d/l] [select]
Re^5: Parsing large text file with perl by wfsp (Abbot) on Sep 03, 2004 at 05:27 UTC
Re^6: Parsing large text file with perl by Anonymous Monk on Sep 03, 2004 at 10:32 UTC