Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks,

I am trying to compare the records from two sets of data, for differences. I have several fields in the table, one of which is called data_set. In this field, the name of the set of data is noted, which identifies the source of the data. The same data_set value can be present in a number of records in the table.

Firstly, I want to be able to retrieve every record associated with set1 and put them into an array. I want to do the same for set2. (if this is the best way to do it)
I then want to be able to compare each record (whole, not individual fields) in set1_array with every record in set2_array and print the record if the record does not appear in set2. The same process should be done to compare set2 to set1. During these processes the data_set field should be ignored as it will obviously differ.

like this...

TABLE data_set | field1 | field2 | field3 ------------------------------------ set1 | yes | 0 | HELLO set1 | yes | 1 | HELLO set1 | no | 1 | HI set2 | yes | 0 | HELLO set2 | yes | 0 | HELLO set2 | no | 0 | HEY set2 | no | 1 | HELLO set3 | no | 1 | HALO SET1 field1 | field2 | field3 ------------------------- yes | 0 | HELLO yes | 1 | HELLO no | 1 | HI SET2 field1 | field2 | field3 ------------------------- yes | 0 | HELLO yes | 0 | HELLO no | 0 | HEY no | 1 | HELLO

So here, compare (set1, row1) to (set2, row1), (set2, row2), (set2, row3) and (set2, row4), then compare the next row in set1 to every row in set2 and so on...
then go through the exact same process but comparing set2 to set1.
Eventually (set1, row2), (set1, row3), (set2, row3) and (set2, row4) will be printed.

Heres my code:

my ($data_set_01, $data_set_02, $format, @data_cat) = @_; my ($dbh, $sth, $sql_01, $sql_02); my ($row1, $row2); my @ds1_rows = (); my @ds2_rows = (); # Process each table foreach (@data_cat) { #------------------ # Data Set 1 #------------------ # Create the query statment $sql_01 = "SELECT * FROM " . $_; # Insert the where clause to the query. # This narrows the data chosen to the data_set the user is + currently looking at. insert_where_clause(\$sql_01, $where_clause_ds1); # Connect to the database. if (! open_database_connection ( \$dbh, $program_config{database}, $program_config{admin_login}, $program_config{admin_pass} ) ) { # Indicate that a problem occured. return 0; } # Execute the query if (! execute_query(\$dbh, \$sth, $sql_01)) { return 0; } # get all records in table for the data set while ($row1 = $sth->fetchrow_array) { push @ds1_rows, $row1; } # Close the database connection. if (! close_database_connection (\$dbh)) { # Indicate that a problem occured. return 0; } #------------------ # Data Set 2 #------------------ # Create the query statment $sql_02 = "SELECT * FROM " . $_; # Insert the where clause to the query. # This narrows the data chosen to the data_set the user is + currently looking at. insert_where_clause(\$sql_02, $where_clause_ds2); # Connect to the database. if (! open_database_connection ( \$dbh, $program_config{database}, $program_config{admin_login}, $program_config{admin_pass} ) ) { # Indicate that a problem occured. return 0; } # Execute the query if (! execute_query(\$dbh, \$sth, $sql_02)) { return 0; } # get all records in table for the data set while ($row2 = $sth->fetchrow_array) { push @ds2_rows, $row2; } print $ds2_rows[0]."\n"; # Close the database connection. if (! close_database_connection (\$dbh)) { # Indicate that a problem occured. return 0; } #------------------------------------ # Compare rows & output differences #------------------------------------ # compare each row in the first set of data to every row # in the second set of data. If it does not occur, then # they are different, so print. foreach (@ds1_rows) { my $record1 = $_; foreach (@ds2_rows) { my $record2 = $_; # if the record cannot be found if ($record1 ne $record2) { print $record1."\n".$record2."\n\n"; } } } # empty arrays for next table @ds1_rows = (); @ds2_rows = (); }
Originally I thought $sth->fetchrow_array would simply retrieve the whole record (row) and i could work from there and do a couple of nested foreach loops and compare the rows. Then repeat the process for the other way round. It seems to only retrieve the last field value?

Sorry for the very long question but I thought i'd better explain it properly to make sure you guys understood what i meant! Also please note, the db functions used here definately work. Hope there is someone who can point me in the right direction!

thanks, Jason

Replies are listed 'Best First'.
Re: Compare records from different data sets
by jhourcle (Prior) on Mar 10, 2005 at 03:03 UTC

    Perl is ideal for comparing strings. That being said, I would either do the comparisons on strings, or I'd do all of the work in the database, if I wanted to deal with it as records.

    We assume we have a function get_records( $sql ) that returns a single scalar for each record, given the sql string, you'd use something like the following general logic: (you may have to adjust to deal with your database's concatenation operator, and select a deliminator that doesn't occur in your fields)

    my $fields = join "||'|'||", qw( field1, field2, field3); my %set1 = map { $_ => 1 } get_records ( "SELECT $fields FROM table WHERE data_set = 'set1'"); my %set2 = map { $_ => 1 } get_records ( "SELECT $fields FROM table WHERE data_set = 'set2'"); my @set2not1 = grep { ! exists($set1{$_}) } keys %set2; my @set1not2 = grep { ! exists($set2{$_}) } keys %set1;

    You can then split the records back out into their individual fields. But there's no reason to go and load all of the records into memory, when you already have them in the database, which is made specifically to handle dealing with records. If you're using a database that can take subqueries, you can do the relationship in one statement. If you're not, you need some temporary tables or views --

    CREATE VIEW set1 AS SELECT field1,field2,field3 FROM table WHERE data_set = 'set1'; CREATE VIEW set2 AS SELECT field1,field2,field3 FROM table WHERE data_set = 'set2';

    You can then use a non-equijoin in Oracle, or a right join in mySQL, or whatever you have in your particular database to get a list of records that don't have a correlation.

    Oh -- and you might want to look over the documentation for DBI, because fetchrow_array() returns a list for each element, which you're then evaluating in scalar context, and would get the count of the fields for each item. You probably wanted fetchrow_arrayref()

      Thats an awsome response, thanks for your help!