The first approach could be to load the file that maps IDs and Names in a hash, then read each ID on the second file extracting the associated name

If that fails (because the "index" file is really huge and the hash doesn't fit in memory) you can try to build a database with your index file (DBD::SQLite) or maybe (it depends on your data) you can try with pack and unpack.

If your IDs are numbers and not too sparse and your names are of similar length, you can try something like:

use strict; use warnings; my ($file1, $file2) = @ARGV; my $maxL = 40; # max name length my $ids = 10000000; # last id my $bin="\0" x ($maxL * $ids) ; # Create the index open my $fh1, "<", $file1 or die $!; while (<$fh1>){ chomp; my ($id,$name) = split /\s+/; substr($bin, $id*$maxL, $maxL, pack ("A$maxL",$name)); } close $fh1; # Search $ids from file 2 open my $fh2, "<", $file2 or die $!; while (<$fh2>){ chomp; my $binval = substr ($bin,$_*$maxL,$maxL); my $valback = unpack ("A$maxL",$binval); print "$_ => $valback\n"; } close $fh2;

If this doesn't help you, provide us with a little more information about your input data (type of "IDs" and "Names", number of rows, etc...)

citromatik


In reply to Re: Merge two huge datasets by ID by citromatik
in thread Merge two huge datasets by ID by dpangpang

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.