Re: Printing the values of unique database records from comparing arrays of records

Seems to me that this is an inefficient way to do this. Wouldn't it be better to make a hash with a MD5 key for each record in the second set, then for each record in the first set, create a MD5 key and check if the key exists in the second set hash? Given a large number of records to match (not to mention a large number of fields in each record), this should speed thing up signficantly, and since you only have to store the hashes in memory, it will also be much easier on memory usage - if you read each record one at a time as you hash it.

Code to follow...

use strict;
use warnings;
use Digest::MD5 qw(md5 md5_hex md5_base64);

my (@data, %data, $key, @record);

# Create nested array of data for testing purposes.

for (<DATA>) {
    chomp; push @data, [split / /];
}

# Join each record and create hash key from contents.
# Note: You have to include field separators (in this
# case tabs), or you could end up with a situation
# where non-identical records match.

for (@data) {
    $key = md5 join "\t", @$_;
    $data{$key} = 1;
}

# Now you can check any record you want by creating
# a key and seeing if it exists in the hash.

@record = qw/aa aa aa aa aa aa aa aa aa/;
$key = md5 join "\t", @record;
print join " ", @record if !$data{$key};

@record = qw/tt ii mm ee tt hh ee rr ee/;
$key = md5 join "\t", @record;
print join " ", @record if !$data{$key};

# You'll still need to match up the field names,
# and you will of course be looping through the
# second set of records instead of doing one at a
# time, but this should serve as an example of
# how to use hashes to drastically cut down on
# the number of comparisons.

__DATA__
oo nn cc ee uu pp oo nn aa
tt ii mm ee tt hh ee rr ee
ww aa ss aa gg oo bb ll ii
[download]

Comment on Re: Printing the values of unique database records from comparing arrays of records Download Code