comment on

Greetings again:

I have a program that takes a file with records from a database, pulls out the information I need from each record, and prints a nice comma delimited file. The program works nice for my test records. However, the file I need to parse through is 4.5 GB. When I start perl on this file, it freezes (or at least it appears to)--there is no growth in the size of the output file but the CPU appears to be processing a huge amount of data. I thought that the program would read just the current record from the file, print that to the output, and move on to the next record, but I do not think this is happening. This is my code abbreviated (taking out the actual reg. expression parsing):


#!/usr/bin/perl -w
use strict;
open(OUT, ">/Users/micwood/Desktop/output.csv");
my $awardhashref= ();
my $allDocs = do
   { local $/ = '<hr>\r';
       <>;
   };

my $rxExtractDoc = qr
   {(?xms)
           (<h4>Award\s\#(\d+)(.*?)<hr>)  
   };

while ($allDocs =~ m{$rxExtractDoc}g )
{my %award = (); # award hash
    
   $award{'record'}= $1;
    $award{'A_awardno'}= $2; 
   $award{'entireaward'}= $3;
    $award{'entireaward'}=~ s/\t//g;
    $award{'entireaward'}=~ s/\r//g;
  
      if ($award{'entireaward'} =~ m{Dollars Obligated(.*?)\$([^<]+?)<
+}gi){
        $award{'B_dollob'} = $2};
      if ($award{'entireaward'} =~ m{Current Contract Value(.*?)\$([^<
+]+?)<}gi){
        $award{'C_currentconvalue'} = $2};
[download]

etc, etc....this deleted section is the data extraction, where it is taking out the information I need. I then print to screen and then write to the OUT file:

    print
       qq{Award Number: $award{'A_awardno'}\n},
       qq{Dollars Obligated: $award{'B_dollob'}\n},
       qq{Current Contract Value: $award{'C_currentconvalue'}\n},
       qq{Ultimate Contract Value: $award{'D_ultconvalue'}\n},
       qq{Contracting Agency: $award{'E_conagency'}\n},      

        q {-} x 25,
       qq{\n};
               
    delete $award{'entireaward'};
    delete $award{'record'};
    foreach my $key (sort keys %award){
    print OUT '"' . $award{$key} . '",'};
    print OUT"\n";    
    $awardhashref= \%award;

}

my @thekeys = sort keys %$awardhashref; 
$, = ",";
print (@thekeys, "\n");
print OUT (@thekeys, "\n");

close OUT;
[download]

so my questions are: should it not be cycling through the file, reading in a record at a time and printing it to the screen and the OUT file? Is there a better to way to deal with reading in blocks given such a large file? Again, the file works great on smaller files but it seems confused with the 4.5 GB file. Is it possible that it is working but won't see anything for a while?

I am still very green with Perl so any help would be greatly appreciated. Thanks again, Michael

In reply to Large file data extraction by micwood

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.