Data Matching Challenge

expo has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to compare two data sets and pull out the matches based on an id. I could use a little wisdom for this :)

Dataset A
A   "Monday"
B   "Tuesday"
C   "Wednesday"
D   "Thursday"


Dataset B
M    252   212  "Bill"
A    325   908  "Jim"
C    426   907  "Mike"
A    423   383  "Sally"
A    993   421  "Jim"
C    737   432  "Mary"
[download]

The goal would be to merge these together such that redundant ids (first fields) are included and those not present in Dataset A are excluded. So the merging and filtering of the data would look something like this:

A   "Monday"     325   908  "Jim"
A   "Monday"     423   383  "Sally"
A   "Monday"     993   421  "Jim"
C   "Wednesday"     737   432  "Mary"
C   "Wednesday"     737   432  "Mary"
[download]

Now, I could easily iterate through two arrays side by side and do a pattern match BUT the problem is speed. I have enormous amounts of data that I need to mine through so it needs to be pretty fast.

I started making a hash table but you need a unique id which is problematic the keys need to be unique and I am interested redundant matches. I started building a matrix using anonymous arrays but it started getting clumsy and I know there is a more elegant way to do this.

Any ideas or suggestions would be greatly appreciated!! Expo

Comment on Data Matching Challenge Select or Download Code

Replies are listed 'Best First'.
Re: Data Matching Challenge by GrandFather (Saint) on Feb 01, 2007 at 20:59 UTC
Use a hash to check for hits from the smaller data set (A in the sample) and generate a composite list: use strict; use warnings; my @DataA = split /\n/, <<DATAA; A "Monday" B "Tuesday" C "Wednesday" D "Thursday" DATAA my @DataB = split /\n/, <<DATAB; M 252 212 "Bill" A 325 908 "Jim" C 426 907 "Mike" A 423 383 "Sally" A 993 421 "Jim" C 737 432 "Mary" DATAB my %AHits; my @merged; for (@DataA) { my ($key, $day) = split /\s+/; $AHits{$key} = $day; } for (@DataB) { my ($key, @params) = split /\s+/; next unless exists $AHits{$key}; push @merged, [$key, $AHits{$key}, @params]; } print join ("\t", @$_), "\n" for sort {$a->[0] cmp $b->[0]} @merged; [download] Prints: `A "Monday" 325 908 "Jim" A "Monday" 423 383 "Sally" A "Monday" 993 421 "Jim" C "Wednesday" 426 907 "Mike" C "Wednesday" 737 432 "Mary"` [download] DWIM is Perl's answer to Gödel	[reply] [d/l] [select]
Re^2: Data Matching Challenge by expo (Initiate) on Feb 02, 2007 at 02:23 UTC
This is very nice, fairly straightforward and clean. I will see if I can implement this approach. Thanks a bunch!! Expo	[reply]
Re: Data Matching Challenge by Tanktalus (Canon) on Feb 01, 2007 at 20:26 UTC
This looks like something I'd pull out a bit of DBI and DBD::CSV (don't let the name fool ya - you should be able to use tabs or spaces or whatever as separators, not just commas), and a small bit of SQL for. I'm sure you could do faster by hand, but I'm not sure it would be as bug-free, nor would it be nearly so quick to develop.	[reply]
Re: Data Matching Challenge by jwkrahn (Abbot) on Feb 02, 2007 at 02:10 UTC
Perhaps this will be fast enough for you: #!/usr/bin/perl use warnings; use strict; use Fcntl ':seek'; open DATASET_A, '<', 'File_A' or die "Cannot open 'File_A' $!"; open DATASET_B, '<', 'File_B' or die "Cannot open 'File_B' $!"; my %dataB; my $tell = tell DATASET_B; while ( <DATASET_B> ) { if ( /^(\S+)/ ) { push @{ $dataB{ $1 } }, $tell + length $1; } $tell = tell DATASET_B; } while ( <DATASET_A> ) { chomp; my ( $key ) = /^(\S+)/; if ( exists $dataB{ $key } ) { for my $seek ( @{ $dataB{ $key } } ) { seek DATASET_B, $seek, SEEK_SET; print $_, scalar <DATASET_B>; } } } __END__ [download]	[reply] [d/l]
Re: Data Matching Challenge by roboticus (Chancellor) on Feb 02, 2007 at 03:09 UTC
expo: Yet another way to do it: You might want to sort both files on the key of interest. Then you can scan through the files side-by-side and print the appropriate items--just like a merge sort. I recently responded to a similar question which you might want to look at if you're interested in this method--it has an example program in it that could easily be modified to handle your problem. --roboticus	[reply]
Re^2: Data Matching Challenge by expo (Initiate) on Feb 03, 2007 at 14:44 UTC
Thanks Roboticus, I will take a look!	[reply]
Re: Data Matching Challenge by runrig (Abbot) on Feb 01, 2007 at 20:20 UTC
join Update: Hmm. Not reading carefully enough...join (Unix command line join, not perl function join) could be made to work, but you said you had arrays, though you did say you had enormous amounts of data...	[reply]