many to many join on text files

aquarium has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
•Re: many to many join on text files by merlyn (Sage) on Apr 14, 2004 at 23:03 UTC
Yup. Use a real database. DBD::SQLite and DBD::Pg both come to mind immediately. -- Randal L. Schwartz, Perl hacker Be sure to read my standard disclaimer if this is a reply.	[reply]
Re: many to many join on text files by Ovid (Cardinal) on Apr 14, 2004 at 23:10 UTC
merlyn was abrupt, but he wasn't kidding. If you can't install a database, try the DBD::SQLite module. It's self-contained (no configuration!), fast, and reliable. Doing this with text files is begging for trouble. Another option would be to use the latest version of DBI and use the new DBD::DBM module, a `DBI` interface to DBM & MLDBM files. Very handy. Cheers, Ovid New address of my CGI Course.	[reply]
Re: Re: many to many join on text files by davido (Cardinal) on Apr 15, 2004 at 01:28 UTC
I know that DBD::SQLite along with DBI is a fast combination, but if the OP is looking for a fast solution toward joining columns in a set of 1,000,000 line columns I have to wonder if SQLite is the best advice. The OP seemed to be looking for a very fast way to join up columns from two flat files. If he's talking about the act of doing an individual insert here and there, the DB is probably great advice. But if it's a one-shot affair, the DB isn't all that fast of a solution. Don't get me wrong; I happen to really like the DBD::SQLite module; I use it myself. But let me give a recent example: Today I wrote a brief throwaway script to parse a single, 3500 record file of two columns per record, and convert it to a three-column database table (first column is a unique key). The processing time on my machine was about four minutes for the insertion of 3500 records. This was ok for me, because I was looking for the advantages that a DB can bring down the road, and didn't care about initial creation time. Taking 3500 divided by four minutes, times 1,000,000 records, divided by 60 minutes gives me the approximate estimate that it could take roughly 19 hours to INSERT 1,000,000 lines. Here's a pseudo-code description of how I managed to take 4 minutes to INSERT 3500 records into a new table. `local $/ = "***\n"; my $sth = dbd->prepare("INSERT INTO table VALUES ( ?, ?, ? ); open my $infile, "<", 'inputfile.txt' or die "Bleah.\n$!"; while ( my $rec = <$infile> ) { chomp $rec; my ( $freq, $desc ) = split /\s+=\s+/, $rec; $sth->execute( undef, $freq, $desc ); } $sth->finish(); close $infile; $dbh->disconnect();` [download] Again, that's just some pseudo-code from memory, but I was surprised to see how much longer* it took to INSERT three columns into a new table as opposed to simply creating a new flat-file with three virtual columns per record. Manipulating the same input file and spitting out a flat-file took just a few seconds by comparison. On the other hand, queries are lightning fast. And once the DB has been created, additional inserts are much faster than trying to "insert" something into the middle of a flat-file. But if initial creation time is the design factor, the DB solution isn't all that snappy. Dave	[reply] [d/l]
Re: Re: Re: many to many join on text files by TomDLux (Vicar) on Apr 15, 2004 at 04:26 UTC
It's not clear whether your slow insert was with DBD::SQLite or with a 'real' DB. 'Real' databases generally have a mechanism for importing data from from a flat file, CSV or tab delimited values, which runs much faster than any script. If you're going to have a field which you do not populate, why not configure the DB to provide a default of 'undef', or specify the SQL statement so that field is not a variabl;e but is hard-coded to 'undef' ???? -- `TTTATCGGTCGTTATATAGATGTTTGCA`	[reply]
Re: Re: Re: Re: many to many join on text files by davido (Cardinal) on Apr 15, 2004 at 04:49 UTC
Re: Re: Re: Re: many to many join on text files by castaway (Parson) on Apr 15, 2004 at 05:26 UTC
Re: Re: Re: many to many join on text files by pizza_milkshake (Monk) on Apr 15, 2004 at 17:46 UTC
assuming SQLite supports it, your inserts could be greatly sped up by either using extended insert syntax like "...VALUES (a,b,c),(d,e,f),(g,h,i)" or using some kind of raw data/csv import utility. the more data you can fit per query the faster it will run. perl -e'$_="nwdd\x7F^n\x7Flm{{llql0}qs\x14";s/./chr(ord$&^30)/ge;print'	[reply]
Re: Re: Re: Re: many to many join on text files by davido (Cardinal) on Apr 15, 2004 at 18:44 UTC
Re: many to many join on text files by tachyon (Chancellor) on Apr 14, 2004 at 23:13 UTC
As merlyn says a RDBMS is probably the best solution, however if the task is simple and the data sorted you can possibly do it in perl much faster than with a RDBMS. What precisely do you want to do? More specifically is the data sorted (or could it be), unique/non unique ie given this input is there a neat way in perl to generate this output. cheers tachyon	[reply]
Re: Re: many to many join on text files by aquarium (Curate) on Apr 14, 2004 at 23:36 UTC
yeah...i can sort it with "sort" shell command. i'll look into sqllite in a moment...hope they support a full outer join, as that's what i need.	[reply]
Re: Re: Re: many to many join on text files by tachyon (Chancellor) on Apr 15, 2004 at 00:21 UTC
It does. cheers tachyon	[reply]
Re: Re: Re: Re: many to many join on text files by aquarium (Curate) on Apr 15, 2004 at 00:33 UTC
Re: Re: Re: Re: Re: many to many join on text files by tachyon (Chancellor) on Apr 15, 2004 at 01:41 UTC
Re: Re: many to many join on text files by aquarium (Curate) on Apr 15, 2004 at 00:29 UTC
..full outer join not implemented in DBD::Sqlite....damn if i go for rdbms it'll take me ages to just load up the tables, as they have varying number of columns and a lot of them (max 47 columns in table). The data is utf-8 with chinese characters in some columns. i'm just matching on first column, which is a plain old long number. trying to do the full join now in perl. it's already very slow....and this is just on the 1000 line subsets of the files!	[reply]
Re: Re: Re: many to many join on text files by aquarium (Curate) on Apr 15, 2004 at 00:37 UTC
finished writing the full join in perl (after which have to do a sort and uniq to strip duplicates)....it's been running for 10 minutes now and still hasn't finished processing first file.....here's the code open(HOLDS,"<holds") or die; while($hold=<HOLDS>) { chomp $hold; @holds = split(/\\|/,$hold,-1); $lookup = $holds[0]; open(COPIES,"<copies") or die; undef $matched; while($copy=<COPIES>) { chomp $copy; @copies = split(/\\|/,$copy,-1); $matchfield = $copies[0]; if($lookup eq $matchfield) { $matched = 1; print "hold and copy\n"; } } if(!$matched) { print "hold\n"; } close COPIES; } close HOLDS; open(COPIES,"<copies") or die; while($copy=<COPIES>) { chomp $copy; @copies = split(/\\|/,$copy,-1); $lookup = $copies[0]; open(HOLDS,"<holds") or die; undef $matched; while($hold=<HOLDS>) { chomp $hold; @holds = split(/\\|/,$hold,-1); $matchfield = $holds[0]; if($lookup eq $matchfield) { $matched = 1; print "copy and hold\n"; } } if(!$matched) { print "copy\n"; } close HOLDS; } close COPIES; [download]	[reply] [d/l]
Re: Re: Re: Re: many to many join on text files by tilly (Archbishop) on Apr 15, 2004 at 02:02 UTC
Re: Re: Re: Re: many to many join on text files by etcshadow (Priest) on Apr 15, 2004 at 03:35 UTC
Re: Re: Re: Re: Re: many to many join on text files by tilly (Archbishop) on Apr 15, 2004 at 06:39 UTC
Re: Re: Re: Re: many to many join on text files by hv (Prior) on Apr 15, 2004 at 01:39 UTC
Re: Re: Re: Re: Re: many to many join on text files by tilly (Archbishop) on Apr 15, 2004 at 01:49 UTC
Some notes below your chosen depth have not been shown here
Re: Re: Re: Re: many to many join on text files by dave_the_m (Monsignor) on Apr 15, 2004 at 13:37 UTC
Re: many to many join on text files by castaway (Parson) on Apr 15, 2004 at 05:33 UTC
From the example, it looks like you want to merge sort the data, ie get it in the same order, then combine matching rows into something. I suspect Sort::Merge might be helpful. As others have said, having both files open in parallel is probably a lot faster than re-opening and re-searching eahch, every time. C.	[reply]
Re: many to many join on text files by jdporter (Paladin) on Apr 15, 2004 at 17:48 UTC
The following was designed to optimize for memory usage. It stores only a hash, whose keys are the values of the key field seen in one of the files, and whose values are arrays of integers. I guess that means that if one file has a million lines, there will be a million numbers stored in memory, plus the overhead for the hash and however many arrays. Despite this memory optimization, I believe this is pretty efficient. The cost is a second, random, read of one of the files. # first, scan the first file, noting the file pos's on which each +key occurs. my %key_pos_in_first_file; # key=key, val=array of file positions. open F1, "+< $first_file" or die "open $first_file for random read + - $!\n"; my $p1 = 0; while (<F1>) { chomp; my @l = split /\\|/; push @{ $key_pos_in_first_file{ $l[0] } }, $p1; $p1 = tell F1; } # second, go through the second file, joining. open F2, "< $second_file" or die "open $second_file for read - $!\ +n"; while (<F2>) { chomp; my @l2 = split /\\|/; # go to each pos in the first file and use that line for my $p1 ( @{ $key_pos_in_first_file{ $l2[0] } } ) { seek F1, $p1, 0; my $l1 = <F1>; chomp $l1; my @l1 = split /\\|/, $l1; # join print "@l2 - @l1\n"; } } close F2; close F1; [download] jdporter The 6th Rule of Perl Club is -- There is no Rule #6.	[reply] [d/l]