Better way to work with large TSV files?

radiantmatrix has asked for the wisdom of the Perl Monks concerning the following question:

O Wise Monks:

I am in the process of writing a Perl suite to (among other things) load extremely large amounts of data from Tab-Separated files to a database over a slow WAN link. By "large", I mean over 50GB.

I have tried two different approaches. One is to load the entire file into a list, prepare the INSERT statement, and then $sth->execute(@list). This is reasonably fast, and works. But, if Win32 (I have no other choice here) crashes during a run, I'm left with a partially-inserted table. Not OK. I'm aware of transactions, but given the size I don't want to have a 2GB file (many are this large) fail near the end and lose a couple hours' worth of statements.

The second approach, which is more fault tolerant, is to use cpan://Tie::File to tie the source file to an array, then pop the array after successfully inserting a row. Like this:

$sth->prepare("INSERT INTO Blah (a,b,c) VALUES (?,?,?)");
tie @list, 'Tie::File', $filename;
while (@list) {
  my @insert = processRow($list[$#list]); #this splits the records int
+o an array and does some data cleaning
  $sth->execute(@insert);
  #And check for errors and all that...
  pop @list;
}

untie @list;
unless ($errors) { unlink $filename; }
[download]

This is very reliable, and allows me to run more than one instance (as long as I work on separate files -- I am using flock). However, it's slow as a dehydrated camel.

Is there a Better Way to accomplish this?

Update:Aristotle has suggested replacing Tie::File with File::ReadBackwards. The latter module has a subset of functionality that I'm using, and is much faster than the former for what I'm doing. Thank you Aristotle!

$sth->prepare("INSERT INTO Blah (a,b,c) VALUES (?,?,?)");
tie *TSV, 'File::ReadBackwards', $filename;
while (<TSV>) {
  my @insert = processRow($_); #this splits the records into an array 
+and does some data cleaning
  $sth->execute(@insert);
  #And check for errors and all that...
  truncate $filename, tell TSV; #pops the last line
}

untie *TSV;
unless ($errors) { unlink $filename; }
[download]

Thanks to all the Monks for your wisdom.

^{--

$me = rand($hacker{perl});}

Comment on Better way to work with large TSV files? Select or Download Code

Replies are listed 'Best First'.
Re: Better way to work with large TSV files? by dragonchild (Archbishop) on Aug 20, 2004 at 15:54 UTC
Convert your file of data into a file of SQL statements. Then, run the SQL statements in a transaction using your RDBMS's standard client (which will be loads quicker than DBI). Of course, you could dispense with DBI altogether and use your RDBMS's loader. Oracle and MySQL both support most xSV formats (such as TSV, CSV, PSV, etc.) ------ We are the carpenters and bricklayers of the Information Age. Then there are Damian modules.... sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon.* - flyingmoose I shouldn't have to say this, but any code, unless otherwise stated, is untested	[reply]
Re: Better way to work with large TSV files? by Aristotle (Chancellor) on Aug 20, 2004 at 15:55 UTC
Using a mass loader would indeed be the fastest option by far. If you have to stick with Perl though, you can certainly gain speed. Tie::File has to do a lot of bookkeeping to a) find all the ends of lines b) update the file per your wishes. Maybe File::ReadBackwards is what you need: it is written to handle precisely the case you have efficiently. Makeshifts last the longest.	[reply]
Re^2: Better way to work with large TSV files? by BrowserUk (Patriarch) on Aug 20, 2004 at 17:35 UTC
How is File::ReadBackwards a substitute for Tie::File for this application? Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon	[reply] [d/l]
Re^3: Better way to work with large TSV files? by Aristotle (Chancellor) on Aug 20, 2004 at 18:02 UTC
`$sth->prepare( "INSERT INTO Blah ( a, b, c ) VALUES ( ?, ?, ? )" ); tie TSV, 'File::ReadBackwards', $filename; while( <TSV> ) { my @insert = processRow( $_ ); $sth->execute( @insert ); truncate $filename, tell TSV; } close TSV; unlink $filename unless $errors;` [download] Makeshifts last the longest.*	[reply] [d/l]
Re^4: Better way to work with large TSV files? by radiantmatrix (Parson) on Aug 24, 2004 at 13:39 UTC
Re: Better way to work with large TSV files? by BrowserUk (Patriarch) on Aug 20, 2004 at 16:00 UTC
If your using MySQL then look at using LOAD DATA infile. Most other RDBMSs will have a similar syntax. Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon	[reply] [d/l]
Re: Better way to work with large TSV files? by graff (Chancellor) on Aug 21, 2004 at 00:20 UTC
Trying to do database operations on that scale, over any sort of WAN link (slow or otherwise) is a tad risky and sub-optimal, I think. Can't you find a way to (s)ftp the TSV file(s) to the database host, then use the database server's native bulk-loader tool on that host to finish the job? (update: for that matter, might be faster/easier to put the 50 GB on a USB or firewire disk and fedex it to remote host... ) If you really do need to do all those row inserts over the WAN link, and if you are confident about knowing the difference between success and failure for each insert, then my first notion would be: configure the insertion script to read from STDIN write a line (anything) to STDOUT after each successful insert start it the first time on a given TSV.file like this: `insertion_script < TSV.file > log.file` [download] If there's an interruption in the process, just count the lines in the log file (let's say this is "N"), and restart the insertion program like this: `tail +N TSV.file \| insertion_script >> log.file` [download] You can simply repeat this as many times as necessary, simply using the appropriate value for N each time (how many lines in the log at present), until the log file has the same line count as the TSV file. (In case you don't know "tail", it's a basic unix util, which could easily be emulated in perl as follows: `#!/usr/bin/perl if ( @ARGV and $ARGV[0] =~ /^+(\d+)/ ) { $bgn = $1; shift; } while (<>) { print if $. >= $bgn; }` [download] this just handles the "+N" usage of tail, which is all you need here; there's probably a one-liner form to do the same thing...)	[reply] [d/l] [select]
Re^2: Better way to work with large TSV files? by radiantmatrix (Parson) on Aug 23, 2004 at 18:27 UTC
Trying to do database operations on that scale, over any sort of WAN link (slow or otherwise) is a tad risky and sub-optimal, I think. Can't you find a way to (s)ftp the TSV file(s) to the database host, then use the database server's native bulk-loader tool on that host to finish the job? (update: for that matter, might be faster/easier to put the 50 GB on a USB or firewire disk and fedex it to remote host... ) Therein lies the problem. This is not a one-time operation, but something done on a regular basis. Also, I am doing quite a bit of data cleaning -- the raw data files are generated by a crappy program over which I have no control; the 'date' fields, for example, don't always have valid 'datetime' data in them, so I must convert before insertion. As much as I'd love to just mass-load the thing, it cannot be so. You can simply repeat this as many times as necessary, simply using the appropriate value for N each time (how many lines in the log at present), until the log file has the same line count as the TSV file. Some of these files are 500,000 lines long. That's a big log to scan just to find a restart point. I'm not sure that would be a good idea, but it's an interesting way to think about the problem. After reading these replies, I'm considering loading the file into memory a 50MB (roughly) chunk at a time, inserting each chunk line-by-line while recording the current line in a file. Something like: `tie %status, 'Tie::Cfg', READ=>$statfile, WRITE=>$statfile; $status{current_line} = 0; do { @chunk_lines = load_chunk_from($file); for (@chunk_lines) { process_line($_); $status{current_line}++; } } while (@chunk_lines);` [download] Then, of course, reading the status file on start to see if I am mid-file and skipping the appropriate number of lines before resuming. What do you all think? Better approach? ^{-- $me = rand($hacker{perl});}	[reply] [d/l]
Re: Better way to work with large TSV files? by iburrell (Chaplain) on Aug 20, 2004 at 20:40 UTC
Modifying a data file seems like a bad solution. A better option is to store the line number reached for each file. If there is a failure, skip to that line number, and start processing again. I would write the filename and line number to a log file after a block of lines. 1000 lines is probably a good size. The log file becomes a nice way to measure the progress, and to record what has been done. When there is a failure, find the last line for each file in the log. Make transactions larger is a big performance boost for bulk inserts. I would make the group for a transaction the same size as the group for writing line numbers. For each group, I would write the log entry and commit.	[reply]
Re: Better way to work with large TSV files? by Anonymous Monk on Aug 23, 2013 at 01:18 UTC
I'm finding that most databases are not happy with these sizes. I use java on the server so I don't know the perl equivalant, but what works for fast access (web application) is to leave the tsv alone, then build binary index files for the queried fields. Each column gets a subdirectory, and each value gets a file which is a list of 64 bit numbers into the original tsv for the corresponding record for that value. If your filesystem can handle it (ext3?) then this works for lots of unique values, and even range searches, and of course you can sort the filenames to get the results back in a certain order. Multi column queries are handled by using the intersection(AND) or union(OR) of a list of pointers. Putting a little effort into figuring out which column/value is the smallest for a starting point helps with AND. Once you have your final list of pointers, you can use randomaccessfile and fetch the corresponding records quickly and add them to the response. I know this sounds like building from scratch, but search engines use a similiar technique. And I have spent far less time doing it the right way than softening my head on various dbms's and related nuances. And it is very memory friendly and fast. This works well for query applications, and I rebuild 7 column indexes on a 20gig tsv file and it is good to go, so putting up an updated tsv is fairly trivial too.	[reply]