maida has asked for the wisdom of the Perl Monks concerning the following question:

Hello, I have to parse several large text files and enter the results in to a database. Each text file is 65 to 70 thousand pages long. I need a jumpstart getting the text into arrays or hashes and then I think I can take it from there. I used sed to remove garbage from the file, but I am unsure where to go from there. I started to use IO::File to read in and split the file, but that didn't seem to go anywhere. Any help would be greatly appreciated. Here is a sample record from the file:
VENDOR 61125 TOTAL DOLLAR VAR 77,097.60 PAGE 1 2003 08 01 VENDOR SIS UNIT BASE SHIP TOT DOL DOLLAR PERCENT CONTRACT NUMBER PRICE PRICE QTY U/I DATE + PR NUMBER BIN/PART NUMBER VALUE VARIANCE VARIANCE YT67DY7898DUFT5126 88.20000 70.00000 50 EA 0000000 +0 POI90809819856 1560007117067 4,410.00 910.00 0 AWARD HISTORY PIIN BSCM N/A U/I UNIT PRI +CE AWD DT QTY OPT DT FOB REP TYPE 765WTY34TF56A 7J777 N EA 39.5 +5000 93012 147 00000 2 Y B PID DATA LINE NR + LINE NR 01 001PART, DESCRIPTION, DATA + 02 002TECHNICAL DATA AVAILABILITY: 03 003
The above record format repeats until EOF. The award history section repeats an undefined number of times for each main record. In case the above record is hard to read here is the basic format of the file. Each record looks like:
Header (Line with the VENDOR 61125) Main Record (a bunch of columns) A number of sub records (more columns) Footer (Yet more columns, everything from PID down)
Thanks -Shawn \

Replies are listed 'Best First'.
Re: Parsing large text file with perl
by dragonchild (Archbishop) on Sep 01, 2004 at 02:59 UTC
    Ugh. If it didn't have two formats in one, I'd say to use whatever import mechanism your database has. (MySQL has LOAD DATA INFILE, Oracle has several different tools, etc.)

    What you're looking for is something like:

    while (<FILE>) { chomp; my @data = split /\t/; # Do something with the line. }

    Alternatively, you could pass the file to Text::xSV, which will do the parsing for you and hand you a line. You'll still have to keep track of whether you're dealing with a main line or an award history line, but you'd have to do that no matter what. (If you didn't, you wouldn't use Perl to solve the problem - you'd use an RDBMS tool.)

    ------
    We are the carpenters and bricklayers of the Information Age.

    Then there are Damian modules.... *sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon. - flyingmoose

    I shouldn't have to say this, but any code, unless otherwise stated, is untested

Re: Parsing large text file with perl
by wfsp (Abbot) on Sep 01, 2004 at 05:49 UTC
    A loop with lots of flags.
    This builds a hash. It loads _all_ the data. You would want to filter out fixed info you don't want. Also, depending on the data you would expect, you would also want to make sure the regexs are tight enough (these are very loose). As you can see, it is in 'verbose' idiom.
    I tried it with 2 records and with more than 1 history field.
    #!/bin/perl5 use strict; use warnings; my %hash; my $true = 1; my $false = 0; my ($header, $history, $footer, @fields ); my ($vendor, $i ); while (<DATA>){ chomp; next if /^$/; if (/^VENDOR/ and /PAGE/){ $header = $true; $footer = $false; @fields = split; $vendor = $fields[1]; push @{$hash{$vendor}{'header'}} , @fields; $i = 0; next; } elsif ( /AWARD\sHISTORY/ ){ $header = $false; $history = $true; next; } elsif ( /PID/ ){ $history = $false; $footer = $true; next; } if ( $history ){ @fields = split; my $history_row = join '', 'history ', $i; push @{$hash{$vendor}{$history_row}}, @fields; $i++; } elsif ( $footer ){ # bug fix, was $header @fields = split; push @{$hash{$vendor}{'footer'}}, @fields; } } open my $out, '>', 'parse.txt'; for my $v ( keys %hash ){ print $out "vendor: $v\n"; for my $rec ( keys %{$hash{$v}} ){ print $out "\trecord:\t$rec\n"; print $out "\t\t"; for my $fld ( @{$hash{$v}{$rec}} ){ print $out "$fld\t"; } print $out "\n"; } } close $out; __DATA__ VENDOR 61125 TOTAL DOLLAR VAR 77,097.60 PAGE 1 2003 08 01 VENDOR SIS UNIT BASE SHIP TOT DOL DOLLAR PERCENT CONTRACT NUMBER PRICE PRICE QTY U/I DATE + PR NUMBER BIN/PART NUMBER VALUE VARIANCE VARIANCE YT67DY7898DUFT5126 88.20000 70.00000 50 EA 0000000 +0 POI90809819856 1560007117067 4,410.00 910.00 0 AWARD HISTORY PIIN BSCM N/A U/I UNIT PRI +CE AWD DT QTY OPT DT FOB REP TYPE 765WTY34TF56A 7J777 N EA 39.5 +5000 93012 147 00000 2 Y B PID DATA LINE NR + LINE NR 01 001PART, DESCRIPTION, DATA + 02 002TECHNICAL DATA AVAILABILITY: 03 003 VENDOR 61126 TOTAL DOLLAR VAR 77,097.60 PAGE 1 2003 08 01 VENDOR SIS UNIT BASE SHIP TOT DOL DOLLAR PERCENT CONTRACT NUMBER PRICE PRICE QTY U/I DATE + PR NUMBER BIN/PART NUMBER VALUE VARIANCE VARIANCE YT67DY7898DUFT5126 88.20000 70.00000 50 EA 0000000 +0 POI90809819856 1560007117067 4,410.00 910.00 0 AWARD HISTORY PIIN BSCM N/A U/I UNIT PRI +CE AWD DT QTY OPT DT FOB REP TYPE 765WTY34TF56A 7J777 N EA 39.5 +5000 93012 147 00000 2 Y B 765WTY34TF56B 7J777 N EA 39.5 +5000 93012 147 00000 2 Y B 765WTY34TF56C 7J777 N EA 39.5 +5000 93012 147 00000 2 Y B PID DATA LINE NR + LINE NR 01 001PART, DESCRIPTION, DATA + 02 002TECHNICAL DATA AVAILABILITY: 03 003
    produces..
    vendor: 61125 record: history 0 765WTY34TF56A 7J777 N EA 39.55000 93012 147... record: footer 01 001PART, DESCRIPTION, DATA 02 002TECHNICAL... record: header VENDOR 61125 TOTAL DOLLAR VAR 77,097.60 PAGE... vendor: 61126 record: history 0 765WTY34TF56A 7J777 N EA 39.55000 93012 147... record: history 2 765WTY34TF56C 7J777 N EA 39.55000 93012 147... record: history 1 765WTY34TF56B 7J777 N EA 39.55000 93012 147... record: footer 01 001PART, DESCRIPTION, DATA 02 002TECHNICAL... record: header VENDOR 61126 TOTAL DOLLAR VAR 77,097.60 PAGE...
    Update: added output
    Update2: Fixed bug! Footer wasn't stored.
    Update3: Truncated and formated the output (tabs were a bad idea)
      Thanks for all the help.... I am almost there with this one. When I get down to the individual histories and need to parse each line how do I handle missing data? For example:
      AP040003EZ9891783 61125 N BX 108.0 +0000 03196 00000 D Y B BP041303DD554 009J0 N BX 8.7 +5000 03168 62 00000 Y W
      I was trying to split on space and then populate an array, but then the colums are messed up.... If a value is missing I get errors like x6 being saved in to the spot for x5. Thanks again, -Shawn
        This looks like a fixed length record. You could use unpack. While the following 'demonstrates' the idea it's probably not the best use of unpack (e.g. there's a floating point number). Your best bet would be to ask another question to find an elegant use of unpack.
        #!/bin/perl5 use strict; use warnings; my @history = <DATA>; for my $record (@history){ print "$record\n"; $record =~ s/^\s*//; my @fields = unpack "a21 a9 a9 a2 a13 a8 a9 a9 a4 a5 a6", $record; for my $field (@fields){ print "*$field*\n"; } } __DATA__ AP040003EZ9891783 61125 N BX 108.0 +0000 03196 00000 D Y B BP041303DD554 009J0 N BX 8.7 +5000 03168 62 00000 Y W
        When you've cracked it you could probably apply it to the other records as well.

Re: Parsing large text file with perl
by TheEnigma (Pilgrim) on Sep 01, 2004 at 04:43 UTC
    The "# Do something with the line." that dragonchild mentioned will likely involve some regular expressions to help in finding the beginning of records, the (variable number of) award history sections, and the footer; among other possible uses. It depends exactly on what data you want to keep and what you want to do with it, of course.

    TheEnigma