join on 6 huge files

dwhite20899 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: join on 6 huge files by BrowserUk (Patriarch) on Jun 10, 2004 at 13:06 UTC
Where's the problem? Open all six files, read 1 line from each, output 1 munged line loop and repeat. Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail	[reply]
Re: join on 6 huge files by Art_XIV (Hermit) on Jun 10, 2004 at 13:31 UTC
If you are certain that the SHA entries are going to be in the same order in all six of the files then reading from each file, one entry at a time, would probably be best. If the SHA entries in the aren't in the same order, then I would try to `cat` the files together, sort them with the `sort` tool, (which can probably sort 56 million lines much faster than perl), and then let a script read the sorted result. Slurping shouldn't be necessary... just keep an array of hex strings for the current SHA number and dump it to the out file when you hit a new SHA. This is just a SWAG, though, and I can't guarantee that `sort` won't get pathological on you with 56 million lines. Hanlon's Razor - "Never attribute to malice that which can be adequately explained by stupidity"	[reply] [d/l] [select]
Re^2: join on 6 huge files by ambrus (Abbot) on Jun 10, 2004 at 21:07 UTC
You don't even need to resort the files if they are already sorted, as sort can merge sorted files: From (textutils.info)sort: `sort' has three modes of operation: sort (the default), merge, and check for sortedness. The following options change the operation mode: `-m' Merge the given files by sorting them as a group. Each input file must always be individually sorted. It always works to sort instead of merge; merging is provided because it is faster, in the case where it works. This means you won't have all six files open at once: you open two for reading, merge them into a temporary file, merge the third with the result etc.	[reply]
Re^2: join on 6 huge files by pbeckingham (Parson) on Jun 10, 2004 at 13:46 UTC
...I can't guarantee that sort won't get pathological on you with 56 million lines. But it may just squeak by with 54 million. Sorry, couldn't resist.	[reply]
Re: join on 6 huge files by pbeckingham (Parson) on Jun 10, 2004 at 12:57 UTC
If the six files are sorted by that first column, then this is just (just!?) a six-way merge, which does not require you to slurp. Is this a one-off? Or are you going to be running this beast often?	[reply]
Re^2: join on 6 huge files by thor (Priest) on Jun 10, 2004 at 16:54 UTC
The value of "how often is this going to run" cannot be overstated. Along the same lines, how quick does it need to be? If there is a need for speed, I'd start whipping out an RDBMS. thor	[reply]
Re^3: join on 6 huge files by BrowserUk (Patriarch) on Jun 10, 2004 at 17:24 UTC
The value of "how often is this going to run" cannot be overstated. Along the same lines, how quick does it need to be? If there is a need for speed, I'd start whipping out an RDBMS. Sorry, but just how just "whipping out an RDBMS" speed up the merging of six flat data files? Once they are merged, they're merged. There is no point in repeating the process. If, however, new flatfiles are received on a regular basis and require merging, how would an RDBMS help? Even if the process(es) producing the flatfiles could be persuades to write the data directly to an RDBMS, just querying the 9,000,000 records from the RDBMS would take longer than merging the flatfiles. Considerably longer. And that without taking into consideration the time taken to insert the data in the first place. Never mind the cost of amending the (potentially 6) applications to write directly to the RDBMS--if that is even possible. Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail	[reply]
Re^4: join on 6 huge files by thor (Priest) on Jun 10, 2004 at 20:07 UTC
Re: join on 6 huge files by Fletch (Bishop) on Jun 10, 2004 at 13:43 UTC
Man, flashbacks to my CS prof talking about the good ol' days of magtapes (of course I've got a couple of 9-track reels somewhere in a box in the garage so I can't really talk). At any rate, as has been said there's nothing wrong with having several files open. You're just doing the last phase of a merge sort with six inputs (since you've guaranteed the inputs are already sorted). Just maintain a list of `[ $filehandle, $next_token ]` pairs and pull off the smallest (and replace it from the handle or remove the pair from the list when the handle hits EOF).	[reply] [d/l]
Re: join on 6 huge files by dba (Monk) on Jun 10, 2004 at 14:06 UTC
unix utilities `sort` and `join` should do the work for you. Are the rows in each file unique and sorted? Depending on requirement use `sort` with `-u` option. You may want to check out `-y` option too.	[reply] [d/l] [select]
Re^2: join on 6 huge files by pbeckingham (Parson) on Jun 10, 2004 at 18:23 UTC
Absolutely right. Doing this in Perl would be a waste of time if this is a one-off, and if there is no prior or subsequent processing. The only value is in implementing a merge sort, which is good practise I suppose. For Perl to sort, it first must slurp. The question of whether the input is already sorted therefore matters greatly. But our OP is absent, and hasn't responded to any questions. So given our lack of information, I agree with dba.	[reply]
Re: join on 6 huge files by dragonchild (Archbishop) on Jun 10, 2004 at 13:59 UTC
Have 6 handles open and simulate peeking at the next line for each filehandle. You have to make sure that you don't attempt to read from a handle that's finished, but that's easy. I wrote one of these in less than 4 hours, with testing. ------ We are the carpenters and bricklayers of the Information Age. Then there are Damian modules.... sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon.* - flyingmoose I shouldn't have to say this, but any code, unless otherwise stated, is untested	[reply]
Re: join on 6 huge files by graff (Chancellor) on Jun 10, 2004 at 22:54 UTC
I have 6 text files of the format SHA-1<tab>7_hex_chars<newline> and each file has 9,000,000 lines. The 6 files are in the same order, with the SHA-1 being the key column of all the files. If that's accurate (don't need to sort or worry about unmatched keys), then something like this would be the first thing I would try -- the unix paste and cut commands: `paste file1 file2 file3 file4 file5 file6 \| cut -f2,4,6,8,10,12 > hex. +cols` [download] That should be pretty close to optimal in terms of performance.	[reply] [d/l]
Re: join on 6 huge files by SageMusings (Beadle) on Jun 12, 2004 at 03:49 UTC
Dwite, If I understand your problem correctly, and you were a bit hazy, you simply do not want to work with these huge files in memory. Right? I would open all the files in a stream fashion, much in the spirit of that old Unix standy "sed". Go through each file line-by-line like you are executing a batch process. The output of each "one line" from each input stream is munged "per-line", not all at once. Then take the resulting concatenation and write to the destination file. It's quick, simple, and elegant. I have written several tools that use a sed tack. Some of my files are as large as 12MB and it never chokes.	[reply]