Re^3: quicker way to merge files?

Depending on the situation, yes, it might well be... (Especially if there is no 'smaller file' and loading them into memory will not work.)

Although you can probably drop the third table, and just write the data out (from a query) instead.

Comment on Re^3: quicker way to merge files?

Replies are listed 'Best First'.
Re^4: quicker way to merge files? by Marshall (Canon) on May 19, 2010 at 15:32 UTC
I think that we are guessing because we don't know the size of the files. I ran a quickie test of generating 10 million hash keys on my machine and on another machine. Results are below. I suspect that some idea of sorting the files with command line utilities and/or using a hash table approach will work out fine and that a DB isn't needed. Heck just keeping one file in memory may be enough! The initial algorithm timing would just skyrocket with files on the size of 10 million lines! I mean if both files are 10 million lines, reading DATA2 10 meg times and parsing it each time is gonna take a while! I don't think that the OP's files are that big, given that he can actually get a result in a few days. I figure that something far less complex than a DB will work out just fine once the huge order of magnitude problems with the algorithm are addressed. #!/usr/bin/perl -w use strict; use Benchmark; timethese (1, { bighashcreate => q{ my %hash; my $max_key = 10000000; #10,000,000: 10 million hash keys for (my $i =1; $i<=$max_key; $i++) { $hash{$i}=1; } }, }, ); __END__ On my wimpy Prescott class machine on Windows XP: Benchmark: timing 1 iterations of bighashcreate, bighashcreate: 81 wallclock secs (78.03 usr + 1.84 sys = 79.87 CPU) @ 0.01/s (n=1) On a server class machine under Linux (running as an average user): Benchmark: timing 1 iterations of bighashcreate... bighashcreate: 23 wallclock secs (22.21 usr + 0.91 sys = 23.12 CPU) @ 0.04/s (n=1) [download]	[reply] [d/l]

Replies are listed 'Best First'.

Re^4: quicker way to merge files?
by Marshall (Canon) on May 19, 2010 at 15:32 UTC

I figure that something far less complex than a DB will work out just fine once the huge order of magnitude problems with the algorithm are addressed.

#!/usr/bin/perl -w
use strict;
use Benchmark;

timethese (1,
  { bighashcreate => 
     q{
         my %hash;
         my $max_key = 10000000; #10,000,000: 10 million hash keys
         for (my $i =1; $i<=$max_key; $i++)
         {
             $hash{$i}=1;
         }
      },  
  },    
);
__END__
On my wimpy Prescott class machine on Windows XP:
Benchmark: timing 1 iterations of bighashcreate,
bighashcreate: 81 wallclock secs (78.03 usr +  1.84 sys = 79.87 CPU) 
               @  0.01/s (n=1)

On a server class machine under Linux (running as an average user):
Benchmark: timing 1 iterations of bighashcreate...
bighashcreate: 23 wallclock secs (22.21 usr +  0.91 sys = 23.12 CPU) 
               @  0.04/s (n=1)
[download]

[reply]
[d/l]