dpangpang has asked for the wisdom of the Perl Monks concerning the following question:

File #1 has two column: ID & Name File #2 has one column: ID Output should have two columns: ID & Name, ID exists in both File1&File2 Please advise! Thank you

Replies are listed 'Best First'.
Re: Merge two huge datasets by ID
by GrandFather (Saint) on Feb 27, 2009 at 05:27 UTC
Re: Merge two huge datasets by ID
by CountZero (Bishop) on Feb 27, 2009 at 06:48 UTC
    Everything depends on how huge "huge" is.

    As long as there is enough memory in your computer to hold both files in a hash (use the contents of the ID column as the key), it is not "huge".

    For any value above that, you are best advised to use a database, such as SQLite for which exists DBD::SQLite.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

Re: Merge two huge datasets by ID
by ikegami (Patriarch) on Feb 27, 2009 at 05:35 UTC
    If the files really are huge, I'd probably use the unix sort command line tool to sort the file by id before attempting to merge them.
Re: Merge two huge datasets by ID
by Tanktalus (Canon) on Feb 27, 2009 at 23:21 UTC

    "Merge" ... "datasets" ... "by (something)". Gee, that sounds like SQL.

    select id, name from fileone where id in (select id from filetwo)
    Unfortunately, I'm not having a lot of luck getting SQL::Statement (which underpins DBD::CSV) into accepting this. Could be something wrong with my syntax, but this doesn't seem that convoluted yet. SQL ERROR: Bad table or column name 'select id from f2' has chars not alphanumeric or underscore! (?) The full code I'm trying is:
    #! /usr/bin/perl use strict; use warnings; use DBI; my $dbh = DBI->connect("DBI:CSV:f_dir=.;csv_eol=\n;csv_sep_char=\t"); unlink 'out'; $dbh->do("create table out (id integer, name char(64))"); $dbh->{csv_tables}{f1}{col_names} = [qw(id name)]; $dbh->{csv_tables}{f2}{col_names} = [qw(id)]; my $d = $dbh->selectall_arrayref('select id, name from f1 where id in +( select id from f2 )', {});
    and something like this should work (assuming tab separators), but, like I said, I'm having some minor problems with it. Assuming we get the kinks worked out (could be my SQL, could be SQL::*), I would think this to be the simplest situation. And, if it's a problem with SQL::*, then moving your "huge" data to a real database may be wise anyway, and that'd solve the problem right out as their SQL parser/engine would probably handle it ;-)

Re: Merge two huge datasets by ID
by leslie (Pilgrim) on Feb 27, 2009 at 09:16 UTC
    my ($File_1_line,$File_2_line); open(F1,"File_1"); open(F2,"File_2"); while($File_1_line=<F1>) { #chomp $File_1_line; my @data=split(' ',$File_1_line); seek(F2,0,0); while($File_2_line=<F2>) { chomp $File_2_line; print "@data\n" if($data[0] eq $File_2_line); } }

    Try this it will help you.

Re: Merge two huge datasets by ID
by citromatik (Curate) on Feb 27, 2009 at 09:24 UTC

    The first approach could be to load the file that maps IDs and Names in a hash, then read each ID on the second file extracting the associated name

    If that fails (because the "index" file is really huge and the hash doesn't fit in memory) you can try to build a database with your index file (DBD::SQLite) or maybe (it depends on your data) you can try with pack and unpack.

    If your IDs are numbers and not too sparse and your names are of similar length, you can try something like:

    use strict; use warnings; my ($file1, $file2) = @ARGV; my $maxL = 40; # max name length my $ids = 10000000; # last id my $bin="\0" x ($maxL * $ids) ; # Create the index open my $fh1, "<", $file1 or die $!; while (<$fh1>){ chomp; my ($id,$name) = split /\s+/; substr($bin, $id*$maxL, $maxL, pack ("A$maxL",$name)); } close $fh1; # Search $ids from file 2 open my $fh2, "<", $file2 or die $!; while (<$fh2>){ chomp; my $binval = substr ($bin,$_*$maxL,$maxL); my $valback = unpack ("A$maxL",$binval); print "$_ => $valback\n"; } close $fh2;

    If this doesn't help you, provide us with a little more information about your input data (type of "IDs" and "Names", number of rows, etc...)

    citromatik