in reply to Re^3: quicker way to merge files?
in thread quicker way to merge files?

I hadn't contemplated what you suggest. While it seems unlikely to be optimum for a one time effort, it appears that it would be quite easy to do and "days" might be ample time to get something of the sort done. It would almost certainly be faster than the current approach and avoids some non-trivial programming that might otherwise be required - and time consuming.

Well, his current algorithm leaves a lot to be desired. O(N2) for very large N.

The following trivial program processes a 3GB/40e6 line file against a 1% subset in 13:47 minutes. A 2% subset takes 13:51. Hash lookups being what they are, the size of the smaller file doesn't grossly affect the processing time. So, within the bounds of memory to construct the hash, the size of the second file does affect the processing time.

It would be interesting to see how long it would take to perform a similar exercise using an RDBMS. ON the basis of my previous attempts, it would take longer that that to load up one file. Especially as the format of the data is not conducive to bulk loading, so you;d have to pre-process it to extract the relevant fields anyway.And by the time you've done that (in perl say), the job is (could be) done.

#! perl -slw use strict; $|++; my( @f, %lookup ); open SMALLER, '<', 'syssort.2%' or die $!; @f = split(), undef $lookup{ "$f[0]$;$f[2]" } while <SMALLER>; close SMALLER; open BIGGER, '<:perlio', 'syssort' or die $!; open OUT, '>', 'out' or die $!; while( <BIGGER> ) { printf "\r$." unless $. % 1000; my @f = split; print OUT "$f[0] $f[2] $f[5]" if exists $lookup{ "$f[0]$;$f[2]" }; } close OUT; close BIGGER;
"I am starting to loop over very large files" suggests this is a repeating and ongoing exercise.

Given that the file sizes are changing, I took that to mean that the files are different each time. And looking at the regex he uses, it looks likely to be some kind of log or trace file. Hence, unlikelto be a done many times to any given pair of files.

That said, you're right that given the distinct lack of information on the OP, it doesn't harm to offer alternatives.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
RIP an inspiration; A true Folk's Guy