in reply to many to many join on text files

merlyn was abrupt, but he wasn't kidding. If you can't install a database, try the DBD::SQLite module. It's self-contained (no configuration!), fast, and reliable. Doing this with text files is begging for trouble.

Another option would be to use the latest version of DBI and use the new DBD::DBM module, a DBI interface to DBM & MLDBM files. Very handy.

Cheers,
Ovid

New address of my CGI Course.

Replies are listed 'Best First'.
Re: Re: many to many join on text files
by davido (Cardinal) on Apr 15, 2004 at 01:28 UTC
    I know that DBD::SQLite along with DBI is a fast combination, but if the OP is looking for a fast solution toward joining columns in a set of 1,000,000 line columns I have to wonder if SQLite is the best advice. The OP seemed to be looking for a very fast way to join up columns from two flat files. If he's talking about the act of doing an individual insert here and there, the DB is probably great advice. But if it's a one-shot affair, the DB isn't all that fast of a solution.

    Don't get me wrong; I happen to really like the DBD::SQLite module; I use it myself. But let me give a recent example:

    Today I wrote a brief throwaway script to parse a single, 3500 record file of two columns per record, and convert it to a three-column database table (first column is a unique key). The processing time on my machine was about four minutes for the insertion of 3500 records. This was ok for me, because I was looking for the advantages that a DB can bring down the road, and didn't care about initial creation time.

    Taking 3500 divided by four minutes, times 1,000,000 records, divided by 60 minutes gives me the approximate estimate that it could take roughly 19 hours to INSERT 1,000,000 lines.

    Here's a pseudo-code description of how I managed to take 4 minutes to INSERT 3500 records into a new table.

    local $/ = "****\n"; my $sth = dbd->prepare("INSERT INTO table VALUES ( ?, ?, ? ); open my $infile, "<", 'inputfile.txt' or die "Bleah.\n$!"; while ( my $rec = <$infile> ) { chomp $rec; my ( $freq, $desc ) = split /\s+=\s+/, $rec; $sth->execute( undef, $freq, $desc ); } $sth->finish(); close $infile; $dbh->disconnect();

    Again, that's just some pseudo-code from memory, but I was surprised to see how much longer it took to INSERT three columns into a new table as opposed to simply creating a new flat-file with three virtual columns per record. Manipulating the same input file and spitting out a flat-file took just a few seconds by comparison.

    On the other hand, queries are lightning fast. And once the DB has been created, additional inserts are much faster than trying to "insert" something into the middle of a flat-file. But if initial creation time is the design factor, the DB solution isn't all that snappy.


    Dave

      It's not clear whether your slow insert was with DBD::SQLite or with a 'real' DB. 'Real' databases generally have a mechanism for importing data from from a flat file, CSV or tab delimited values, which runs much faster than any script.

      If you're going to have a field which you do not populate, why not configure the DB to provide a default of 'undef', or specify the SQL statement so that field is not a variabl;e but is hard-coded to 'undef' ????

      --
      TTTATCGGTCGTTATATAGATGTTTGCA

        Sorry if I was unclear.

        First, yes, I used DBD::SQLite and got the results described. Individual inserts are obviously faster than rewriting a flat-file every time you want to add something. And queries are very fast. But the act of doing 3000 inserts is a lot slower than writing a plain old flat file. Nothing surprising there; obviously writing a flat file should be quicker on a one-shot deal. But the point that I took from the OP's question was that he was creating a joined file from two separate files, and wanted to do so quickly. While maintaining a database efficiently is probably faster than maintaining a flat file, he seemed to be concerned with the creation time, not the ongoing maintenance time.

        To answer your question about the hard-coded undef, check the documentation for SQLite. My first column in the table is an ID column, of type, "INTEGER PRIMARY KEY". I didn't mention this before because it was irrelevant to the discussion. The SQLite documentation states that this field can be populated with an autoincrementing unique index number, and to do so, all you have to do is insert a NULL value into that column, and SQLite will handle the rest. ...And a good way to insert NULL using DBI is to insert undef.

        To your point, it might have been better to use undef in that field of my $sth->prepare() method call instead of one additional placeholder, but the speed gain is truly minimal. Adding one more placeholder isn't my code's bottleneck (I know because I initially created a similar database a couple days ago without the ID field and witnessed similar creation time).


        Dave

        I'm not sure what you're implying, since I think SQLite is more 'real' DB than some as claim it.. Anyway, SQLite can COPY with the best of 'em.

        C.

      assuming SQLite supports it, your inserts could be greatly sped up by either using extended insert syntax like "...VALUES (a,b,c),(d,e,f),(g,h,i)" or using some kind of raw data/csv import utility. the more data you can fit per query the faster it will run.

      perl -e'$_="nwdd\x7F^n\x7Flm{{llql0}qs\x14";s/./chr(ord$&^30)/ge;print'

        As Castaway pointed out, there is a COPY function incorporated within the SQL that SQLite understands. That should also be considerably faster, and if I were planning on converting a text file again, especially one of greater size than what I dealt with last time, I'd use that.

        The biggest problem with COPY (that I see) is that in order to get the autoincrementing first column, you have to preprocess the flat file by prepending a \N character to the beginning of each record. Of course that can be accomplished by a one-liner, but it would be nice if the COPY feature of SQLite's SQL provided for automatic padding and such.


        Dave