I have a program that takes a file with records from a database, pulls out the information I need from each record, and prints a nice comma delimited file. The program works nice for my test records. However, the file I need to parse through is 4.5 GB. When I start perl on this file, it freezes (or at least it appears to)--there is no growth in the size of the output file but the CPU appears to be processing a huge amount of data. I thought that the program would read just the current record from the file, print that to the output, and move on to the next record, but I do not think this is happening. This is my code abbreviated (taking out the actual reg. expression parsing):
#!/usr/bin/perl -w use strict; open(OUT, ">/Users/micwood/Desktop/output.csv"); my $awardhashref= (); my $allDocs = do { local $/ = '<hr>\r'; <>; }; my $rxExtractDoc = qr {(?xms) (<h4>Award\s\#(\d+)(.*?)<hr>) }; while ($allDocs =~ m{$rxExtractDoc}g ) {my %award = (); # award hash $award{'record'}= $1; $award{'A_awardno'}= $2; $award{'entireaward'}= $3; $award{'entireaward'}=~ s/\t//g; $award{'entireaward'}=~ s/\r//g; if ($award{'entireaward'} =~ m{Dollars Obligated(.*?)\$([^<]+?)< +}gi){ $award{'B_dollob'} = $2}; if ($award{'entireaward'} =~ m{Current Contract Value(.*?)\$([^< +]+?)<}gi){ $award{'C_currentconvalue'} = $2};
etc, etc....this deleted section is the data extraction, where it is taking out the information I need. I then print to screen and then write to the OUT file:
print qq{Award Number: $award{'A_awardno'}\n}, qq{Dollars Obligated: $award{'B_dollob'}\n}, qq{Current Contract Value: $award{'C_currentconvalue'}\n}, qq{Ultimate Contract Value: $award{'D_ultconvalue'}\n}, qq{Contracting Agency: $award{'E_conagency'}\n}, q {-} x 25, qq{\n}; delete $award{'entireaward'}; delete $award{'record'}; foreach my $key (sort keys %award){ print OUT '"' . $award{$key} . '",'}; print OUT"\n"; $awardhashref= \%award; } my @thekeys = sort keys %$awardhashref; $, = ","; print (@thekeys, "\n"); print OUT (@thekeys, "\n"); close OUT;
so my questions are: should it not be cycling through the file, reading in a record at a time and printing it to the screen and the OUT file? Is there a better to way to deal with reading in blocks given such a large file? Again, the file works great on smaller files but it seems confused with the 4.5 GB file. Is it possible that it is working but won't see anything for a while?
I am still very green with Perl so any help would be greatly appreciated. Thanks again, Michael
In reply to Large file data extraction by micwood
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |