Large file data extraction

micwood has asked for the wisdom of the Perl Monks concerning the following question:

Greetings again:

I have a program that takes a file with records from a database, pulls out the information I need from each record, and prints a nice comma delimited file. The program works nice for my test records. However, the file I need to parse through is 4.5 GB. When I start perl on this file, it freezes (or at least it appears to)--there is no growth in the size of the output file but the CPU appears to be processing a huge amount of data. I thought that the program would read just the current record from the file, print that to the output, and move on to the next record, but I do not think this is happening. This is my code abbreviated (taking out the actual reg. expression parsing):


#!/usr/bin/perl -w
use strict;
open(OUT, ">/Users/micwood/Desktop/output.csv");
my $awardhashref= ();
my $allDocs = do
   { local $/ = '<hr>\r';
       <>;
   };

my $rxExtractDoc = qr
   {(?xms)
           (<h4>Award\s\#(\d+)(.*?)<hr>)  
   };

while ($allDocs =~ m{$rxExtractDoc}g )
{my %award = (); # award hash
    
   $award{'record'}= $1;
    $award{'A_awardno'}= $2; 
   $award{'entireaward'}= $3;
    $award{'entireaward'}=~ s/\t//g;
    $award{'entireaward'}=~ s/\r//g;
  
      if ($award{'entireaward'} =~ m{Dollars Obligated(.*?)\$([^<]+?)<
+}gi){
        $award{'B_dollob'} = $2};
      if ($award{'entireaward'} =~ m{Current Contract Value(.*?)\$([^<
+]+?)<}gi){
        $award{'C_currentconvalue'} = $2};
[download]

etc, etc....this deleted section is the data extraction, where it is taking out the information I need. I then print to screen and then write to the OUT file:

    print
       qq{Award Number: $award{'A_awardno'}\n},
       qq{Dollars Obligated: $award{'B_dollob'}\n},
       qq{Current Contract Value: $award{'C_currentconvalue'}\n},
       qq{Ultimate Contract Value: $award{'D_ultconvalue'}\n},
       qq{Contracting Agency: $award{'E_conagency'}\n},      

        q {-} x 25,
       qq{\n};
               
    delete $award{'entireaward'};
    delete $award{'record'};
    foreach my $key (sort keys %award){
    print OUT '"' . $award{$key} . '",'};
    print OUT"\n";    
    $awardhashref= \%award;

}

my @thekeys = sort keys %$awardhashref; 
$, = ",";
print (@thekeys, "\n");
print OUT (@thekeys, "\n");

close OUT;
[download]

so my questions are: should it not be cycling through the file, reading in a record at a time and printing it to the screen and the OUT file? Is there a better to way to deal with reading in blocks given such a large file? Again, the file works great on smaller files but it seems confused with the 4.5 GB file. Is it possible that it is working but won't see anything for a while?

I am still very green with Perl so any help would be greatly appreciated. Thanks again, Michael

Comment on Large file data extraction Select or Download Code

Replies are listed 'Best First'.
Re: Large file data extraction by GrandFather (Saint) on Aug 12, 2008 at 00:28 UTC
Hold on. Do you really slurp a 4.5 GB file into memory? On anything with less than about 20 GB of memory that will cause thrashing like you wouldn't believe (err, ok, maybe you would - you've seen it)! It looks like you are parsing HTML so you should take a hard look at modules like HTML::Parser to do a lot of the heavy lifting for you. If you are not dealing with HTML, then at least nest the while loop in an outer while loop that reads a record at a time rather than slurping the whole file. Perl reduces RSI - it saves typing	[reply]
Re^2: Large file data extraction by tod222 (Pilgrim) on Aug 12, 2008 at 01:43 UTC
To elaborate on GrandFather's comment: 4.5GB is a huge amount of data (the CPU appears to be processing a huge amount of data). Whether or not you take GrandFather's suggestion to use HTML::Parser, you definitely want to take his suggestion of replacing the slurp with a record-at-a-time read. A while loop reading a record at a time will allow for useful print statements for debugging or progress reporting.	[reply]
Re: Large file data extraction by Cristoforo (Curate) on Aug 12, 2008 at 01:34 UTC
I put together a partial solution - it's difficult to provide an accurate answer without seeing the rest of your parsing code and some sample data. I've not used any of the HTML parsers, so I can't say how they might work. Like you, I've rolled my own parser, but as I say, it's difficult without seeing the data. I'm assuming you are reading a file created on Windows on a Unix machine. That would explain why you are using the \r at different places in your code. Perhaps this might give a little start. #!/usr/bin/perl use strict; use warnings; my $awardhashref; # Why needing this? Already printing out keys in the + loop. # use s modifier so '.' matches newlines # No need to end regex with <hr> - your record already terminates with + it. my $rxExtractDoc = qr{(<h4>Award\s#(\d+)(.?))}s; my $out = "/Users/micwood/Desktop/output.csv"; open OUT, '>', $out or die "Unable to open $out for writing"; { local $/ = "<hr>"; while (<>) { chomp; if (/$rxExtractDoc/) { my %award; $award{record}= $1; $award{A_awardno}= $2; $award{entireaward}= $3; # Do you really want to replace each tab # with the 'empty string", (nothing)? $award{entireaward}=~ s/\t//g; # Eliminate Windows's \r $award{entireaward}=~ s/\r//g; if ($award{entireaward} =~ m{Dollars Obligated.?\$([^<]+) +<}is){ $award{B_dollob} = $1; }; if ($award{entireaward} =~ m{Current Contract Value.*?\$([ +^<]+)<}is){ $award{C_currentconvalue} = $1; }; #... further parsing print # print to terminal qq{Award Number: $award{A_awardno}\n}, qq{Dollars Obligated: $award{B_dollob}\n}, qq{Current Contract Value: $award{C_currentconvalue +}\n}, qq{Ultimate Contract Value: $award{D_ultconvalue}\n +}, qq{Contracting Agency: $award{E_conagency}\n}, + q {-} x 25, qq{\n}; delete $award{entireaward}; delete $award{record}; # print to file print OUT join(',', map {"$award{$_}"} sort keys %award), +"\n"; # $awardhashref= \%award; ? } } } close OUT or die "Unable to close $out"; [download] Update: Added chomp and changed inner while loop to an if. Also, set $/ to <hr>. Thanks ikegami. Update2: Changed the print to output file. I was printing the keys instead of the values.	[reply] [d/l]
Re^2: Large file data extraction by ikegami (Patriarch) on Aug 12, 2008 at 02:03 UTC
`'<hr>\r\n'`? or `"<hr>\r\n"`... Might as well just use `'<hr>'`. Then it'll work no matter which line ending is used.	[reply] [d/l] [select]
Re: Large file data extraction by Fletch (Bishop) on Aug 11, 2008 at 23:41 UTC
Erm, you never read anything other than the first "record" (and your delimiter of `<hr>\r` looks suspicious; you probably mean `"<hr>\n"` instead which might explain why it's slurping the entire file in if the delimiter never actually matches), and then you're constantly iterate searching through that same record text. So long as there's a single match you're going to be stuck in an infinite loop looking at the same data every time (at least as the amended code reads; all bets are off given the truncated code sample). You really want something more along the lines of (presuming this really is your delimeter) the normal idiom for searching through a file: `local $/ = "<hr>\n"; while( my $line = <> ) { ## process results from $line ... }` [download] (That aside, given this looks to be some sort of HTML you may be better off if it's sufficiently XML-y enough using one of the stream capable XML parsers (for instance XML::Twig will work this way; see the section "Processing an XML document chunk by chunk") than trying to rip things apart with regexen.) Update: Duur, quite right. Completely missed the `/g` modifier. The cake is a lie. The cake is a lie. The cake is a lie.	[reply] [d/l] [select]
Re^2: Large file data extraction by GrandFather (Saint) on Aug 12, 2008 at 00:36 UTC
So long as there's a single match you're going to be stuck in an infinite loop Actually no. There is a g modifier on the regex. The while loop only iterates as long as there is another match. Consider: `print "$1\n" while '1234X5678X' =~ m/([^X]+)X/g;` [download] Prints: `1234 5678` [download] Perl reduces RSI - it saves typing	[reply] [d/l] [select]
Re: Large file data extraction by eosbuddy (Scribe) on Aug 11, 2008 at 21:35 UTC
Not sure whether this will work: try to turn off buffering: `$\| = 1` [download] right after your strict pragma. Update: err... forgot the crucial `;` like: `$\| = 1;` [download]	[reply] [d/l] [select]
Re^2: Large file data extraction by micwood (Acolyte) on Aug 11, 2008 at 23:24 UTC
Thanks I tried but no luck. I think there is a problem in first do loop (and the fact that it isn't incorporating the while loop). It appears to be reading in the entire file and then dividing at the delimiters, which given such a huge file, it can't do. Any ideas on how to make the first loops to read in only a record at a time and then move on to the next record, without reading in the entire file? Thanks Again!!	[reply]
Re^3: Large file data extraction by eosbuddy (Scribe) on Aug 11, 2008 at 23:38 UTC
Apologies for not scrutinizing your program better earlier. Still, after looking at it, I am a bit puzzled as I thought your code "reads" a file, yet I don't see the filehandle for the input. For reading a file line by line, the usual prescription is: `while (<INFILE>) { ... code goes here }` [download]	[reply] [d/l]
Re: Large file data extraction by ikegami (Patriarch) on Aug 12, 2008 at 02:00 UTC
I think there is a problem in first do loop [...]. It appears to be reading in the entire file I don't see any `do` loop in the code you presented, and the `do` is only reading until the contents of input record seperator (`$/`) is found, not the entire file. But then again, I doubt your IRS exists in the file. `'<hr>\r'` should be `"<hr>\r"` if you want to the `\r` to match a carriage return.	[reply] [d/l] [select]
Re^2: Large file data extraction by micwood (Acolyte) on Aug 12, 2008 at 03:13 UTC
okay. i am slowly getting there. you are correct, the delimiter should have been in `"<hr>\r"`such that it was slurping the entire file in with `'<hr>\r'`. however, as it stands now, the following changes to the program only reads in the first record and then quits without looping to the next record: `my $allDocs = do { local $/ = "<hr>\r"; <>; }; my $rxExtractDoc = qr {(?xms) (<h4>Award\s\#(\d+)(.?)<hr>) }; while ($allDocs =~ m{$rxExtractDoc}g ) { my %award = (); # award hash $award{'record'}= $1; $award{'A_awardno'}= $2; $award{'entireaward'}= $3; # $award{'entireaward'}=~ s/\n//g; $award{'entireaward'}=~ s/\t//g; $award{'entireaward'}=~ s/\r//g; if ($award{'entireaward'} =~ m{Dollars Obligated(.?)\$([^<]+?)< +}gi){ $award{'B_dollob'} = $2};` [download] {the rest of the code continues.} But it doesn't loop to the next record once it finishes extracting the first record? And I am not sure why? Thoughts? Thank you so much again. This is very helpful (and I am at wits end as is) :-) Also, someone mentioned that there is no input....I have the file on the command line such that <> refers to it. Is that not correct?	[reply] [d/l] [select]
Re^3: Large file data extraction by ikegami (Patriarch) on Aug 12, 2008 at 03:51 UTC
But it doesn't loop to the next record once it finishes extracting the first record Your regexp is wrong, or your data isn't what you think it is. What's in `$allDocs`? someone mentioned that there is no input....I have the file on the command line such that <> refers to it. It seems that eosbuddy missed or didn't understand `<>`.	[reply] [d/l] [select]
Re^4: Large file data extraction by micwood (Acolyte) on Aug 12, 2008 at 05:59 UTC
Re^4: Large file data extraction by micwood (Acolyte) on Aug 12, 2008 at 05:17 UTC
Re^3: Large file data extraction by peter (Sexton) on Aug 12, 2008 at 15:04 UTC
$allDocs won't contain all records, only the first. <> in scalar context will return one 'record'. Use something like: `local $/ = "<hr>\r"; my $rxExtractDoc = qr {(?xms) (<h4>Award\s\#(\d+)(.*?)<hr>) }; while(<>) { # read one record at a time if($allDocs =~ m{$rxExtractDoc}g ) { my %award = (); # award hash $award{'record'}= $1; $award{'A_awardno'}= $2; $award{'entireaward'}= $3; # $award{'entireaward'}=~ s/\n//g; $award{'entireaward'}=~ s/\t//g; $award{'entireaward'}=~ s/\r//g; # ... rest of code } }` [download] And yes, <> will read from STDIN or filenames from command line. I hope this helps. Peter Stuifzand	[reply] [d/l]