Re^2: Better way to work with large TSV files?

Trying to do database operations on that scale, over any sort of WAN link (slow or otherwise) is a tad risky and sub-optimal, I think. Can't you find a way to (s)ftp the TSV file(s) to the database host, then use the database server's native bulk-loader tool on that host to finish the job?
(update: for that matter, might be faster/easier to put the 50 GB on a USB or firewire disk and fedex it to remote host... )

Therein lies the problem. This is not a one-time operation, but something done on a regular basis. Also, I am doing quite a bit of data cleaning -- the raw data files are generated by a crappy program over which I have no control; the 'date' fields, for example, don't always have valid 'datetime' data in them, so I must convert before insertion.

As much as I'd love to just mass-load the thing, it cannot be so.

You can simply repeat this as many times as necessary, simply using the appropriate value for N each time (how many lines in the log at present), until the log file has the same line count as the TSV file.

Some of these files are 500,000 lines long. That's a big log to scan just to find a restart point. I'm not sure that would be a good idea, but it's an interesting way to think about the problem.

After reading these replies, I'm considering loading the file into memory a 50MB (roughly) chunk at a time, inserting each chunk line-by-line while recording the current line in a file. Something like:

tie %status, 'Tie::Cfg', READ=>$statfile, WRITE=>$statfile;
$status{current_line} = 0;
do {
   @chunk_lines = load_chunk_from($file);
   for (@chunk_lines) {
     process_line($_);
     $status{current_line}++;
   }
} while (@chunk_lines);
[download]

Then, of course, reading the status file on start to see if I am mid-file and skipping the appropriate number of lines before resuming. What do you all think? Better approach?

^{--

$me = rand($hacker{perl});}

Comment on Re^2: Better way to work with large TSV files? Download Code