comment on

Trying to do database operations on that scale, over any sort of WAN link (slow or otherwise) is a tad risky and sub-optimal, I think. Can't you find a way to (s)ftp the TSV file(s) to the database host, then use the database server's native bulk-loader tool on that host to finish the job?
(update: for that matter, might be faster/easier to put the 50 GB on a USB or firewire disk and fedex it to remote host... )

Therein lies the problem. This is not a one-time operation, but something done on a regular basis. Also, I am doing quite a bit of data cleaning -- the raw data files are generated by a crappy program over which I have no control; the 'date' fields, for example, don't always have valid 'datetime' data in them, so I must convert before insertion.

As much as I'd love to just mass-load the thing, it cannot be so.

You can simply repeat this as many times as necessary, simply using the appropriate value for N each time (how many lines in the log at present), until the log file has the same line count as the TSV file.

Some of these files are 500,000 lines long. That's a big log to scan just to find a restart point. I'm not sure that would be a good idea, but it's an interesting way to think about the problem.

After reading these replies, I'm considering loading the file into memory a 50MB (roughly) chunk at a time, inserting each chunk line-by-line while recording the current line in a file. Something like:

tie %status, 'Tie::Cfg', READ=>$statfile, WRITE=>$statfile;
$status{current_line} = 0;
do {
   @chunk_lines = load_chunk_from($file);
   for (@chunk_lines) {
     process_line($_);
     $status{current_line}++;
   }
} while (@chunk_lines);
[download]

Then, of course, reading the status file on start to see if I am mid-file and skipping the appropriate number of lines before resuming. What do you all think? Better approach?

^{--

$me = rand($hacker{perl});}

In reply to Re^2: Better way to work with large TSV files? by radiantmatrix
in thread Better way to work with large TSV files? by radiantmatrix

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.