in reply to Better way to work with large TSV files?

Trying to do database operations on that scale, over any sort of WAN link (slow or otherwise) is a tad risky and sub-optimal, I think. Can't you find a way to (s)ftp the TSV file(s) to the database host, then use the database server's native bulk-loader tool on that host to finish the job?

(update: for that matter, might be faster/easier to put the 50 GB on a USB or firewire disk and fedex it to remote host... )

If you really do need to do all those row inserts over the WAN link, and if you are confident about knowing the difference between success and failure for each insert, then my first notion would be:

If there's an interruption in the process, just count the lines in the log file (let's say this is "N"), and restart the insertion program like this:
tail +N TSV.file | insertion_script >> log.file
You can simply repeat this as many times as necessary, simply using the appropriate value for N each time (how many lines in the log at present), until the log file has the same line count as the TSV file.

(In case you don't know "tail", it's a basic unix util, which could easily be emulated in perl as follows:

#!/usr/bin/perl if ( @ARGV and $ARGV[0] =~ /^+(\d+)/ ) { $bgn = $1; shift; } while (<>) { print if $. >= $bgn; }
this just handles the "+N" usage of tail, which is all you need here; there's probably a one-liner form to do the same thing...)

Replies are listed 'Best First'.
Re^2: Better way to work with large TSV files?
by radiantmatrix (Parson) on Aug 23, 2004 at 18:27 UTC
    Trying to do database operations on that scale, over any sort of WAN link (slow or otherwise) is a tad risky and sub-optimal, I think. Can't you find a way to (s)ftp the TSV file(s) to the database host, then use the database server's native bulk-loader tool on that host to finish the job?

    (update: for that matter, might be faster/easier to put the 50 GB on a USB or firewire disk and fedex it to remote host... )

    Therein lies the problem. This is not a one-time operation, but something done on a regular basis. Also, I am doing quite a bit of data cleaning -- the raw data files are generated by a crappy program over which I have no control; the 'date' fields, for example, don't always have valid 'datetime' data in them, so I must convert before insertion.

    As much as I'd love to just mass-load the thing, it cannot be so.

    You can simply repeat this as many times as necessary, simply using the appropriate value for N each time (how many lines in the log at present), until the log file has the same line count as the TSV file.
    Some of these files are 500,000 lines long. That's a big log to scan just to find a restart point. I'm not sure that would be a good idea, but it's an interesting way to think about the problem.

    After reading these replies, I'm considering loading the file into memory a 50MB (roughly) chunk at a time, inserting each chunk line-by-line while recording the current line in a file. Something like:

    tie %status, 'Tie::Cfg', READ=>$statfile, WRITE=>$statfile; $status{current_line} = 0; do { @chunk_lines = load_chunk_from($file); for (@chunk_lines) { process_line($_); $status{current_line}++; } } while (@chunk_lines);
    Then, of course, reading the status file on start to see if I am mid-file and skipping the appropriate number of lines before resuming. What do you all think? Better approach?
    --
    $me = rand($hacker{perl});