I know that DBD::SQLite along with DBI is a fast combination, but if the OP is looking for a fast solution toward joining columns in a set of 1,000,000 line columns I have to wonder if SQLite is the best advice. The OP seemed to be looking for a very fast way to join up columns from two flat files. If he's talking about the act of doing an individual insert here and there, the DB is probably great advice. But if it's a one-shot affair, the DB isn't all that fast of a solution.
Don't get me wrong; I happen to really like the DBD::SQLite module; I use it myself. But let me give a recent example:
Today I wrote a brief throwaway script to parse a single, 3500 record file of two columns per record, and convert it to a three-column database table (first column is a unique key). The processing time on my machine was about four minutes for the insertion of 3500 records. This was ok for me, because I was looking for the advantages that a DB can bring down the road, and didn't care about initial creation time.
Taking 3500 divided by four minutes, times 1,000,000 records, divided by 60 minutes gives me the approximate estimate that it could take roughly 19 hours to INSERT 1,000,000 lines.
Here's a pseudo-code description of how I managed to take 4 minutes to INSERT 3500 records into a new table.
local $/ = "****\n";
my $sth = dbd->prepare("INSERT INTO table VALUES ( ?, ?, ? );
open my $infile, "<", 'inputfile.txt' or die "Bleah.\n$!";
while ( my $rec = <$infile> ) {
chomp $rec;
my ( $freq, $desc ) = split /\s+=\s+/, $rec;
$sth->execute( undef, $freq, $desc );
}
$sth->finish();
close $infile;
$dbh->disconnect();
Again, that's just some pseudo-code from memory, but I was surprised to see how much longer it took to INSERT three columns into a new table as opposed to simply creating a new flat-file with three virtual columns per record. Manipulating the same input file and spitting out a flat-file took just a few seconds by comparison.
On the other hand, queries are lightning fast. And once the DB has been created, additional inserts are much faster than trying to "insert" something into the middle of a flat-file. But if initial creation time is the design factor, the DB solution isn't all that snappy.
|