Re: Merge two huge datasets by ID

The first approach could be to load the file that maps IDs and Names in a hash, then read each ID on the second file extracting the associated name

If that fails (because the "index" file is really huge and the hash doesn't fit in memory) you can try to build a database with your index file (DBD::SQLite) or maybe (it depends on your data) you can try with pack and unpack.

If your IDs are numbers and not too sparse and your names are of similar length, you can try something like:

use strict;
use warnings;

my ($file1, $file2) = @ARGV;

my $maxL = 40; # max name length
my $ids = 10000000;  # last id
my $bin="\0" x ($maxL * $ids) ;

# Create the index
open my $fh1, "<", $file1 or die $!; 
while (<$fh1>){
    chomp;
    my ($id,$name) = split /\s+/;
    substr($bin, $id*$maxL, $maxL, pack ("A$maxL",$name));
}
close $fh1;

# Search $ids from file 2
open my $fh2, "<", $file2 or die $!;
while (<$fh2>){
    chomp;
    my $binval = substr ($bin,$_*$maxL,$maxL);
    my $valback = unpack ("A$maxL",$binval);
    print "$_ => $valback\n";
}
close $fh2;
[download]

If this doesn't help you, provide us with a little more information about your input data (type of "IDs" and "Names", number of rows, etc...)

citromatik

Comment on Re: Merge two huge datasets by ID Select or Download Code