in reply to Re: quicker way to merge files?
in thread quicker way to merge files?

So, to get a single file with merged contents, you're advising:

  1. Create a database.
  2. Create two tables.
  3. Load the input files into the tables.
  4. Join the tables to a third.
  5. Dump the third table back to a new file.

And you anticipate this will be quicker than just merging the files?


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
RIP an inspiration; A true Folk's Guy

Replies are listed 'Best First'.
Re^3: quicker way to merge files?
by DStaal (Chaplain) on May 19, 2010 at 13:45 UTC

    Depending on the situation, yes, it might well be... (Especially if there is no 'smaller file' and loading them into memory will not work.)

    Although you can probably drop the third table, and just write the data out (from a query) instead.

      I think that we are guessing because we don't know the size of the files. I ran a quickie test of generating 10 million hash keys on my machine and on another machine. Results are below. I suspect that some idea of sorting the files with command line utilities and/or using a hash table approach will work out fine and that a DB isn't needed. Heck just keeping one file in memory may be enough! The initial algorithm timing would just skyrocket with files on the size of 10 million lines! I mean if both files are 10 million lines, reading DATA2 10 meg times and parsing it each time is gonna take a while! I don't think that the OP's files are that big, given that he can actually get a result in a few days.

      I figure that something far less complex than a DB will work out just fine once the huge order of magnitude problems with the algorithm are addressed.

      #!/usr/bin/perl -w use strict; use Benchmark; timethese (1, { bighashcreate => q{ my %hash; my $max_key = 10000000; #10,000,000: 10 million hash keys for (my $i =1; $i<=$max_key; $i++) { $hash{$i}=1; } }, }, ); __END__ On my wimpy Prescott class machine on Windows XP: Benchmark: timing 1 iterations of bighashcreate, bighashcreate: 81 wallclock secs (78.03 usr + 1.84 sys = 79.87 CPU) @ 0.01/s (n=1) On a server class machine under Linux (running as an average user): Benchmark: timing 1 iterations of bighashcreate... bighashcreate: 23 wallclock secs (22.21 usr + 0.91 sys = 23.12 CPU) @ 0.04/s (n=1)
Re^3: quicker way to merge files?
by ig (Vicar) on May 19, 2010 at 19:47 UTC

    I hadn't contemplated what you suggest. While it seems unlikely to be optimum for a one time effort, it appear that it would be quite easy to do and "days" might be ample time to get something of the sort done. It would almost certainly be faster than the current approach and avoids some non-trivial programming that might otherwise be required - and time consuming.

    More compelling is that this appears not to be a one time requirement. "I am starting to loop over very large files" suggests this is a repeating and ongoing exercise. It might be better to change the processes that produce the input data to write it directly to a database as it is produced, avoiding the intermediate files. And there is no mention of what is done with the merged file. It might be better to revise the processes that use the output file to access such a database directly.

    Along the lines of not re-inventing the wheel, I suggest consideration be given to taking advantage of a well known tool (RDBMS) that appears to be quite relevant to the problem at hand.

    There is not nearly enough information in the post to know what might be best, which is why I only suggested to consider. And your points also are worthy of careful consideration in the broader context of the requirements, though not the only way to use a database in this situation- whatever it is.

      I hadn't contemplated what you suggest. While it seems unlikely to be optimum for a one time effort, it appears that it would be quite easy to do and "days" might be ample time to get something of the sort done. It would almost certainly be faster than the current approach and avoids some non-trivial programming that might otherwise be required - and time consuming.

      Well, his current algorithm leaves a lot to be desired. O(N2) for very large N.

      The following trivial program processes a 3GB/40e6 line file against a 1% subset in 13:47 minutes. A 2% subset takes 13:51. Hash lookups being what they are, the size of the smaller file doesn't grossly affect the processing time. So, within the bounds of memory to construct the hash, the size of the second file does affect the processing time.

      It would be interesting to see how long it would take to perform a similar exercise using an RDBMS. ON the basis of my previous attempts, it would take longer that that to load up one file. Especially as the format of the data is not conducive to bulk loading, so you;d have to pre-process it to extract the relevant fields anyway.And by the time you've done that (in perl say), the job is (could be) done.

      #! perl -slw use strict; $|++; my( @f, %lookup ); open SMALLER, '<', 'syssort.2%' or die $!; @f = split(), undef $lookup{ "$f[0]$;$f[2]" } while <SMALLER>; close SMALLER; open BIGGER, '<:perlio', 'syssort' or die $!; open OUT, '>', 'out' or die $!; while( <BIGGER> ) { printf "\r$." unless $. % 1000; my @f = split; print OUT "$f[0] $f[2] $f[5]" if exists $lookup{ "$f[0]$;$f[2]" }; } close OUT; close BIGGER;
      "I am starting to loop over very large files" suggests this is a repeating and ongoing exercise.

      Given that the file sizes are changing, I took that to mean that the files are different each time. And looking at the regex he uses, it looks likely to be some kind of log or trace file. Hence, unlikelto be a done many times to any given pair of files.

      That said, you're right that given the distinct lack of information on the OP, it doesn't harm to offer alternatives.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.