•Re: many to many join on text files
by merlyn (Sage) on Apr 14, 2004 at 23:03 UTC
|
| [reply] |
Re: many to many join on text files
by Ovid (Cardinal) on Apr 14, 2004 at 23:10 UTC
|
merlyn was abrupt, but he wasn't kidding. If you can't install a database, try the DBD::SQLite module. It's self-contained (no configuration!), fast, and reliable. Doing this with text files is begging for trouble.
Another option would be to use the latest version of DBI and use the new DBD::DBM module, a DBI interface to DBM & MLDBM files. Very handy.
| [reply] |
|
|
I know that DBD::SQLite along with DBI is a fast combination, but if the OP is looking for a fast solution toward joining columns in a set of 1,000,000 line columns I have to wonder if SQLite is the best advice. The OP seemed to be looking for a very fast way to join up columns from two flat files. If he's talking about the act of doing an individual insert here and there, the DB is probably great advice. But if it's a one-shot affair, the DB isn't all that fast of a solution.
Don't get me wrong; I happen to really like the DBD::SQLite module; I use it myself. But let me give a recent example:
Today I wrote a brief throwaway script to parse a single, 3500 record file of two columns per record, and convert it to a three-column database table (first column is a unique key). The processing time on my machine was about four minutes for the insertion of 3500 records. This was ok for me, because I was looking for the advantages that a DB can bring down the road, and didn't care about initial creation time.
Taking 3500 divided by four minutes, times 1,000,000 records, divided by 60 minutes gives me the approximate estimate that it could take roughly 19 hours to INSERT 1,000,000 lines.
Here's a pseudo-code description of how I managed to take 4 minutes to INSERT 3500 records into a new table.
local $/ = "****\n";
my $sth = dbd->prepare("INSERT INTO table VALUES ( ?, ?, ? );
open my $infile, "<", 'inputfile.txt' or die "Bleah.\n$!";
while ( my $rec = <$infile> ) {
chomp $rec;
my ( $freq, $desc ) = split /\s+=\s+/, $rec;
$sth->execute( undef, $freq, $desc );
}
$sth->finish();
close $infile;
$dbh->disconnect();
Again, that's just some pseudo-code from memory, but I was surprised to see how much longer it took to INSERT three columns into a new table as opposed to simply creating a new flat-file with three virtual columns per record. Manipulating the same input file and spitting out a flat-file took just a few seconds by comparison.
On the other hand, queries are lightning fast. And once the DB has been created, additional inserts are much faster than trying to "insert" something into the middle of a flat-file. But if initial creation time is the design factor, the DB solution isn't all that snappy.
| [reply] [d/l] |
|
|
It's not clear whether your slow insert was with DBD::SQLite or with a 'real' DB. 'Real' databases generally have a mechanism for importing data from from a flat file, CSV or tab delimited values, which runs much faster than any script.
If you're going to have a field which you do not populate, why not configure the DB to provide a default of 'undef', or specify the SQL statement so that field is not a variabl;e but is hard-coded to 'undef' ????
--
TTTATCGGTCGTTATATAGATGTTTGCA
| [reply] |
|
|
|
|
|
|
assuming SQLite supports it, your inserts could be greatly sped up by either using extended insert syntax like "...VALUES (a,b,c),(d,e,f),(g,h,i)" or using some kind of raw data/csv import utility. the more data you can fit per query the faster it will run.
perl -e'$_="nwdd\x7F^n\x7Flm{{llql0}qs\x14";s/./chr(ord$&^30)/ge;print'
| [reply] |
|
|
Re: many to many join on text files
by tachyon (Chancellor) on Apr 14, 2004 at 23:13 UTC
|
As merlyn says a RDBMS is probably the best solution, however if the task is simple and the data sorted you can possibly do it in perl much faster than with a RDBMS. What precisely do you want to do? More specifically is the data sorted (or could it be), unique/non unique ie given this input is there a neat way in perl to generate this output.
| [reply] |
|
|
yeah...i can sort it with "sort" shell command. i'll look into sqllite in a moment...hope they support a full outer join, as that's what i need.
| [reply] |
|
|
| [reply] |
|
|
|
|
|
|
..full outer join not implemented in DBD::Sqlite....damn
if i go for rdbms it'll take me ages to just load up the tables, as they have varying number of columns and a lot of them (max 47 columns in table). The data is utf-8 with chinese characters in some columns. i'm just matching on first column, which is a plain old long number.
trying to do the full join now in perl. it's already very slow....and this is just on the 1000 line subsets of the files!
| [reply] |
|
|
finished writing the full join in perl (after which have to do a sort and uniq to strip duplicates)....it's been running for 10 minutes now and still hasn't finished processing first file.....here's the code
open(HOLDS,"<holds") or die;
while($hold=<HOLDS>) {
chomp $hold;
@holds = split(/\|/,$hold,-1);
$lookup = $holds[0];
open(COPIES,"<copies") or die;
undef $matched;
while($copy=<COPIES>) {
chomp $copy;
@copies = split(/\|/,$copy,-1);
$matchfield = $copies[0];
if($lookup eq $matchfield) {
$matched = 1;
print "hold and copy\n";
}
}
if(!$matched) {
print "hold\n";
}
close COPIES;
}
close HOLDS;
open(COPIES,"<copies") or die;
while($copy=<COPIES>) {
chomp $copy;
@copies = split(/\|/,$copy,-1);
$lookup = $copies[0];
open(HOLDS,"<holds") or die;
undef $matched;
while($hold=<HOLDS>) {
chomp $hold;
@holds = split(/\|/,$hold,-1);
$matchfield = $holds[0];
if($lookup eq $matchfield) {
$matched = 1;
print "copy and hold\n";
}
}
if(!$matched) {
print "copy\n";
}
close HOLDS;
}
close COPIES;
| [reply] [d/l] |
|
|
|
|
|
|
|
|
|
|
|
|
|
Re: many to many join on text files
by castaway (Parson) on Apr 15, 2004 at 05:33 UTC
|
From the example, it looks like you want to merge sort the data, ie get it in the same order, then combine matching rows into something. I suspect Sort::Merge might be helpful. As others have said, having both files open in parallel is probably a lot faster than re-opening and re-searching eahch, every time.
C. | [reply] |
Re: many to many join on text files
by jdporter (Paladin) on Apr 15, 2004 at 17:48 UTC
|
The following was designed to optimize for memory usage. It stores only a hash, whose keys are the values of the key field seen in one of the files, and whose values are arrays of integers. I guess that means that if one file has a million lines, there will be a million numbers stored in memory, plus the overhead for the hash and however many arrays.
Despite this memory optimization, I believe this is pretty efficient. The cost is a second, random, read of one of the files.
# first, scan the first file, noting the file pos's on which each
+key occurs.
my %key_pos_in_first_file; # key=key, val=array of file positions.
open F1, "+< $first_file" or die "open $first_file for random read
+ - $!\n";
my $p1 = 0;
while (<F1>) {
chomp;
my @l = split /\|/;
push @{ $key_pos_in_first_file{ $l[0] } }, $p1;
$p1 = tell F1;
}
# second, go through the second file, joining.
open F2, "< $second_file" or die "open $second_file for read - $!\
+n";
while (<F2>) {
chomp;
my @l2 = split /\|/;
# go to each pos in the first file and use that line
for my $p1 ( @{ $key_pos_in_first_file{ $l2[0] } } ) {
seek F1, $p1, 0;
my $l1 = <F1>;
chomp $l1;
my @l1 = split /\|/, $l1;
# join
print "@l2 - @l1\n";
}
}
close F2;
close F1;
jdporter The 6th Rule of Perl Club is -- There is no Rule #6.
| [reply] [d/l] |