processing huge files

geektron has asked for the wisdom of the Perl Monks concerning the following question:

I've searched the monestary and googled for a possible solution but nothing has come up that i'm not already doing.

I have a HUGE file that i need to parse and insert into a database ( ~334G of text ), and the current code is devouring memory and bringing the system to its knees.

the algorithm i'm currently using:


open( FILE, $filename )  
   or die " Can't open file ($filename):  $!\n";

while ( my $line = <FILE> ) {

   chomp $line;

   my @fields = split( /\t/, $line );

   ### do some added parsing, then insert
}
[download]

other nodes, like Quickest way of reading in large files?, reading (caching?) large files, mention the while loop as the least memory intensive, but shy of trying to split the file into other, smaller files, is there a solution?

UPDATES:
OK, added clarifications.

the RDBMS in question is MySQL 4.1.10
there aren't any wrappers (like Class::DBI or Tie::DBI )
because i need to reorder the fields (well, i need to identify the MLSNumber (since it's real estate data ) and see if the data needs and ins/upd and if the data needs to be inserted into a view ... i don't see the MySQL loader being the right tool for the job ...

and for the rest of the loop:

## outside the loop
my @firstLines = qw/ AccessInstr PropType UID Acres MLSNum MediaSource
+ /;
my %firsts = map { $_ => 1 } @firstLines;

## the loop 
while ( my $line = <FILE> ) {

   chomp $line;

   my @fields = split( /\t/, $line );

   my $lines = 0;
   if ( $firsts{$fields[0]} ) {
      %lookup = map { $lines++ => $_ } @fields;
      @dbFields = values %lookup;
   }

   %reversed = reverse %lookup;
   $mlsField = $reversed{'MLSNum'};
   $mlsNumber = $fields[$mlsField];

   $officeListField = $reversed{'OfficeList'};
   my $officeList = $fields[$officeListField];

   $officeSellField = $reversed{'OfficeSell'};
   my $officeSell = $fields[$officeSellField];

   next if ( $firsts{$fields[0]} );

   while ( scalar( @fields ) != scalar( @dbFields ) ) {
      push ( @fields, undef );
   }

   $testSth->execute( $mlsNumber );
   my $inDB = $testSth->fetchrow();

   $jbgTestSth->execute( $mlsNumber );
   my $inJBGDB = $jbgTestSth->fetchrow();

   my @updateVals ;
   if ( $inDB ) {
      my $i = 0;
      foreach my $loop ( @fields ) {
         push( @updateVals, $loop ) unless $i == $mlsField;
         $i++;
      }

      push( @updateVals, $mlsNumber );
      $updSth->execute( @updateVals );

   } else {
      $insSth->execute( @fields ) 
          unless ( $fields[$mlsField] eq '' );

   }

   if ( ( $officeList && grep( /$officeList/, @offices ) ) 
     or ( $officeSell && grep( /$officeSell/, @offices ) ) ) {
      if ( $inJBGDB ) {
         $updSummary->execute( @updateVals ) 
           or warn "no update! : " . $dbh->errstr();
      } else {
         $insSummary->execute( @fields )
      }
   }

}
[download]

Comment on processing huge files Select or Download Code

Replies are listed 'Best First'.
Re: processing huge files by Corion (Patriarch) on Aug 02, 2005 at 08:33 UTC
If you're loading stuff into a database via DBI, bind variables are generally the fastest way. Avoid Class::DBI or any other wrappers, and look into the `execute_array` method maybe, to do the insert in batches. If you can use them, look into the external loader utilities for the various databases - it might well be faster to output SQL strings from Perl and pipe them into `mysqlldr`, or use the parser language of `bcp` or `sqlldr` instead of doing the munging in Perl.	[reply]
Re: processing huge files by rev_1318 (Chaplain) on Aug 02, 2005 at 08:25 UTC
What happens with the data inside the `while` loop? The memory hog must be somewhere there. Are you storing `$line` or `@fields` in another data structure? Please show us a small complete program that demonstrates your problem. Paul	[reply] [d/l] [select]
Re^2: processing huge files by geektron (Curate) on Aug 02, 2005 at 14:03 UTC
per your request, the loop is added to the node.	[reply]
Re: processing huge files by anonymized user 468275 (Curate) on Aug 02, 2005 at 08:39 UTC
This looks like it may be a database question you just haven't asked yet and to which the answer is likely to be database-dependent. For example, if you were inserting into Sybase, then all other things being horribly equal, you would be 1) building a huge uncheckpointed transaction log and 2) locking database resources on a grand scale 3) creating a single transaction of gigantic proportions. Any one of which even taken alone could be causing such symptoms. If the database isn't Sybase, the same cause may be happening conceptually, but will demonstrate different database-dependent symptoms. Please may we have the rest of the loop plus the identity and version of the RDBMS. One world, one people	[reply]
Re^2: processing huge files by geektron (Curate) on Aug 02, 2005 at 14:04 UTC
the rest of the loop is there, as well as the "secret identity" of the RDBMS. ( MySQL 4.1.10 )	[reply]
Re^3: processing huge files by anonymized user 468275 (Curate) on Aug 03, 2005 at 08:23 UTC
In this case, the mysql LOAD DATA syntax command appears to be the way to go as another responder has already suggested. The replies that suggest using perl classes are missing the point that the database just can't take all those inserts - that's why databases come with bulk loading facilities. One world, one people	[reply]
Re^4: processing huge files by geektron (Curate) on Aug 03, 2005 at 14:19 UTC
Re: processing huge files by jhourcle (Prior) on Aug 02, 2005 at 12:20 UTC
I have no idea if this is the actual problem that you are seeing, but long series of inserts are very, very bad for most databases. If you already have indexes on the table, then every single insert had to modify the table, and the indexes. If you have constraints, then they need to be checked for every record. If the process can be done completely through the database's builtin loading program (mysql's LOAD DATA syntax, Oracle's SQL*Loader, etc.), then use it directly. This will defer the modifications to the indexes until after everything has completed, and other optimizations for bulk changes. If the data needs some sort of cleaning before insertion, then I would recommend cleaning the values, and then writing back out a text file ... then using the database's loading utility to slurp in the processed file.	[reply]
Re: processing huge files by fauria (Deacon) on Aug 02, 2005 at 13:11 UTC
Well, if the file is that huge and cannot be handled, maybe using "divide and conquer" works: use strict; my $file="filename"; my $file_size = 30720; #grab file size in MB. my $chunks = 1226; #How many pieces my $size = int $file_size / $chunks + 1; my $counter = 0; for(0..$chunks){ my $skip = $size * $counter; `dd if=$file of=$file.$counter bs=1M count=$size skip=$skip`; $counter++; } [download] And will output 1226 files called filename.*. Note that you will need an amount of disk space equal to the file size, and be sure when reading parts to open chunks just before finding the text separator (ie new line character) on each chunk, as it might be distributed bewteen one or more chunks. Also, be sure to close processed files! :D	[reply] [d/l]
Re^2: processing huge files by jhourcle (Prior) on Aug 02, 2005 at 13:35 UTC
If you're going to hand off the work to dd, you might want to use split, as it can act on full lines (so won't break in the middle of a record, given the logic the OP was using.). You also don't need to recursively call it, as the equivalent to your dd example would be: `split -b 30720m -a 3 $INFILE`	[reply] [d/l]
Re^3: processing huge files by fauria (Deacon) on Aug 02, 2005 at 17:32 UTC
I used dd because it can access directly a position in a file using skip, and then sequentialy read its content, without needing to load the whole file and then point to a location.	[reply]
Re: processing huge files by graff (Chancellor) on Aug 02, 2005 at 21:58 UTC
because i need to reorder the fields (well, i need to identify the MLSNumber (since it's real estate data) and see if the data needs and ins/upd and if the data needs to be inserted into a view ... i don't see the MySQL loader being the right tool for the job ... What if you made an initial "analyze/index" pass over the big file, to identify the updates vs. inserts, identify the view inserts, etc. In other words, "parse" the big file first, and maybe even split it up into pieces according to what needs to be done with each subset (if you have enough disk to store a second copy of it all -- but if you don't have that much space, writing a set of byte-offset indexes is likely to take a lot less than 334G). Once you know how to break down the file contents according to what sort of treatment they need, push the insert records into mysqlimport -- that will go remarkably fast with fairly low load on the system. Maybe you'll still need to handle updates via DBI, but that will be easier to optimize and will go a lot quicker if you avoid doing DBI inserts in the same run (and make sure the "where" clause cites an indexed field, and use prepared statements with "?" placeholders, etc). Of course, anything you do with 334G of data is going to take a while, and the idea of doing your job in multiple passes over the data might seem silly. But think about it: each pass will be easier to code, easier to validate, and relatively efficient and fast at runtime with relatively low system load, compared to a monolithic "one pass to do it all". In the long run, the time required to do a lot of little tasks in succession could easily end up being less than the time required to do them all at once in a single massive job.	[reply]
Re: processing huge files by izut (Chaplain) on Aug 02, 2005 at 11:28 UTC
You can use Tie::File to read the file as an array and then parse and insert into database. `use Tie::File; tie @array, 'Tie::File', $filename or die $!; for (@array) { # parse # insert }` [download] Igor. Igor S. Lopes - izut surrender to perl. your code, your rules.	[reply] [d/l]
Re^2: processing huge files by geektron (Curate) on Aug 02, 2005 at 16:55 UTC
after reading the perldoc for Tie::File it sounded like a good idea™ ... but it's not behaving per docs. I'm trying: `tie my @array, 'Tie::File', $filename, mode => O_RDONLY; my $recs = @array; warn "we have $recs records to parse "; pop @array; foreach my $element ( @array ) { print "ok, we'd be updating $element \n"; }` [download] the `tie` statement is straight out of the perldoc. i need to skip the first line during processing since it's a "header" line. the error that comes back is: `we have 280579 records to parse at ./upd_res_data2.pl line 72, <FH> l +ine 280579. Couldn't write record: Bad file descriptor at /usr/lib/perl5/5.8.0/Tie +/File.pm line 665, <FH> line 280580.` [download] and i don't think it should be writing, because it's (supposedly) opened read-only. any insight?	[reply] [d/l] [select]
Re^3: processing huge files by izut (Chaplain) on Aug 02, 2005 at 19:31 UTC
I tried: file.txt `# header 1 2 3 4 5 6` [download] test.pl `use Tie::File; tie @f, 'Tie::File', "file.txt" or die $!; foreach (@f[1..$#f]) { print $_, "\n"; }` [download] Result is: `1 2 3 4 5 6` [download] Igor S. Lopes - izut surrender to perl. your code, your rules.	[reply] [d/l] [select]
Re^4: processing huge files by geektron (Curate) on Aug 02, 2005 at 21:44 UTC
Re^5: processing huge files by izut (Chaplain) on Aug 02, 2005 at 22:21 UTC