Merge two huge datasets by ID

dpangpang has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Merge two huge datasets by ID by GrandFather (Saint) on Feb 27, 2009 at 05:27 UTC
How is this question different than Read two files and print? True laziness is hard work	[reply]
Re: Merge two huge datasets by ID by CountZero (Bishop) on Feb 27, 2009 at 06:48 UTC
Everything depends on how huge "huge" is. As long as there is enough memory in your computer to hold both files in a hash (use the contents of the ID column as the key), it is not "huge". For any value above that, you are best advised to use a database, such as SQLite for which exists DBD::SQLite. CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James	[reply]
Re: Merge two huge datasets by ID by ikegami (Patriarch) on Feb 27, 2009 at 05:35 UTC
If the files really are huge, I'd probably use the unix `sort` command line tool to sort the file by id before attempting to merge them.	[reply] [d/l]
Re: Merge two huge datasets by ID by Tanktalus (Canon) on Feb 27, 2009 at 23:21 UTC
"Merge" ... "datasets" ... "by (something)". Gee, that sounds like SQL. `select id, name from fileone where id in (select id from filetwo)` [download] Unfortunately, I'm not having a lot of luck getting SQL::Statement (which underpins DBD::CSV) into accepting this. Could be something wrong with my syntax, but this doesn't seem that convoluted yet. `SQL ERROR: Bad table or column name 'select id from f2' has chars not alphanumeric or underscore!` (?) The full code I'm trying is: `#! /usr/bin/perl use strict; use warnings; use DBI; my $dbh = DBI->connect("DBI:CSV:f_dir=.;csv_eol=\n;csv_sep_char=\t"); unlink 'out'; $dbh->do("create table out (id integer, name char(64))"); $dbh->{csv_tables}{f1}{col_names} = [qw(id name)]; $dbh->{csv_tables}{f2}{col_names} = [qw(id)]; my $d = $dbh->selectall_arrayref('select id, name from f1 where id in +( select id from f2 )', {});` [download] and something like this should work (assuming tab separators), but, like I said, I'm having some minor problems with it. Assuming we get the kinks worked out (could be my SQL, could be SQL::), I would think this to be the simplest situation. And, if it's a problem with SQL::, then moving your "huge" data to a real database may be wise anyway, and that'd solve the problem right out as their SQL parser/engine would probably handle it ;-)	[reply] [d/l] [select]
Re: Merge two huge datasets by ID by leslie (Pilgrim) on Feb 27, 2009 at 09:16 UTC
`my ($File_1_line,$File_2_line); open(F1,"File_1"); open(F2,"File_2"); while($File_1_line=<F1>) { #chomp $File_1_line; my @data=split(' ',$File_1_line); seek(F2,0,0); while($File_2_line=<F2>) { chomp $File_2_line; print "@data\n" if($data[0] eq $File_2_line); } }` [download] Try this it will help you.	[reply] [d/l]
Re: Merge two huge datasets by ID by citromatik (Curate) on Feb 27, 2009 at 09:24 UTC
The first approach could be to load the file that maps `IDs` and `Names` in a hash, then read each `ID` on the second file extracting the associated name If that fails (because the "index" file is really huge and the hash doesn't fit in memory) you can try to build a database with your index file (DBD::SQLite) or maybe (it depends on your data) you can try with pack and unpack. If your IDs are numbers and not too sparse and your names are of similar length, you can try something like: use strict; use warnings; my ($file1, $file2) = @ARGV; my $maxL = 40; # max name length my $ids = 10000000; # last id my $bin="\0" x ($maxL * $ids) ; # Create the index open my $fh1, "<", $file1 or die $!; while (<$fh1>){ chomp; my ($id,$name) = split /\s+/; substr($bin, $id$maxL, $maxL, pack ("A$maxL",$name)); } close $fh1; # Search $ids from file 2 open my $fh2, "<", $file2 or die $!; while (<$fh2>){ chomp; my $binval = substr ($bin,$_$maxL,$maxL); my $valback = unpack ("A$maxL",$binval); print "$_ => $valback\n"; } close $fh2; [download] If this doesn't help you, provide us with a little more information about your input data (type of "IDs" and "Names", number of rows, etc...) citromatik	[reply] [d/l] [select]