comment on

I know that DBD::SQLite along with DBI is a fast combination, but if the OP is looking for a fast solution toward joining columns in a set of 1,000,000 line columns I have to wonder if SQLite is the best advice. The OP seemed to be looking for a very fast way to join up columns from two flat files. If he's talking about the act of doing an individual insert here and there, the DB is probably great advice. But if it's a one-shot affair, the DB isn't all that fast of a solution.

Don't get me wrong; I happen to really like the DBD::SQLite module; I use it myself. But let me give a recent example:

Today I wrote a brief throwaway script to parse a single, 3500 record file of two columns per record, and convert it to a three-column database table (first column is a unique key). The processing time on my machine was about four minutes for the insertion of 3500 records. This was ok for me, because I was looking for the advantages that a DB can bring down the road, and didn't care about initial creation time.

Taking 3500 divided by four minutes, times 1,000,000 records, divided by 60 minutes gives me the approximate estimate that it could take roughly 19 hours to INSERT 1,000,000 lines.

Here's a pseudo-code description of how I managed to take 4 minutes to INSERT 3500 records into a new table.

local $/ = "****\n";
my $sth = dbd->prepare("INSERT INTO table VALUES ( ?, ?, ? );
open my $infile, "<", 'inputfile.txt' or die "Bleah.\n$!";
while ( my $rec = <$infile> ) {
    chomp $rec;
    my ( $freq, $desc ) = split /\s+=\s+/, $rec;
    $sth->execute( undef, $freq, $desc );
}
$sth->finish();
close $infile;
$dbh->disconnect();
[download]

Again, that's just some pseudo-code from memory, but I was surprised to see how much longer it took to INSERT three columns into a new table as opposed to simply creating a new flat-file with three virtual columns per record. Manipulating the same input file and spitting out a flat-file took just a few seconds by comparison.

On the other hand, queries are lightning fast. And once the DB has been created, additional inserts are much faster than trying to "insert" something into the middle of a flat-file. But if initial creation time is the design factor, the DB solution isn't all that snappy.

Dave

In reply to Re: Re: many to many join on text files by davido
in thread many to many join on text files by aquarium

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.