Compare records from different data sets

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks,

I am trying to compare the records from two sets of data, for differences. I have several fields in the table, one of which is called data_set. In this field, the name of the set of data is noted, which identifies the source of the data. The same data_set value can be present in a number of records in the table.

Firstly, I want to be able to retrieve every record associated with set1 and put them into an array. I want to do the same for set2. (if this is the best way to do it)
I then want to be able to compare each record (whole, not individual fields) in set1_array with every record in set2_array and print the record if the record does not appear in set2. The same process should be done to compare set2 to set1. During these processes the data_set field should be ignored as it will obviously differ.

like this...

TABLE
data_set | field1 | field2 | field3 
------------------------------------
   set1  |  yes   |   0    | HELLO  
   set1  |  yes   |   1    | HELLO  
   set1  |  no    |   1    | HI    
   set2  |  yes   |   0    | HELLO  
   set2  |  yes   |   0    | HELLO  
   set2  |  no    |   0    | HEY    
   set2  |  no    |   1    | HELLO  
   set3  |  no    |   1    | HALO   

SET1
 field1 | field2 | field3 
-------------------------
  yes   |   0    | HELLO  
  yes   |   1    | HELLO  
  no    |   1    | HI

SET2
 field1 | field2 | field3 
-------------------------
  yes   |   0    | HELLO  
  yes   |   0    | HELLO  
  no    |   0    | HEY    
  no    |   1    | HELLO
[download]

So here, compare (set1, row1) to (set2, row1), (set2, row2), (set2, row3) and (set2, row4), then compare the next row in set1 to every row in set2 and so on...
then go through the exact same process but comparing set2 to set1.
Eventually (set1, row2), (set1, row3), (set2, row3) and (set2, row4) will be printed.

Heres my code:

    my ($data_set_01, $data_set_02, $format, @data_cat) = @_;
    my ($dbh, $sth, $sql_01, $sql_02);
    my ($row1, $row2);
    my @ds1_rows = ();
    my @ds2_rows = ();

        # Process each table
        foreach (@data_cat)
        {
            #------------------
            # Data Set 1
            #------------------

            # Create the query statment
            $sql_01 = "SELECT * FROM " . $_;

            # Insert the where clause to the query.
            # This narrows the data chosen to the data_set the user is
+ currently looking at.
            insert_where_clause(\$sql_01, $where_clause_ds1);    

            # Connect to the database.
            if (! open_database_connection 
                    (    \$dbh,
                        $program_config{database},
                        $program_config{admin_login},
                        $program_config{admin_pass}
                    )
                )
            {
                # Indicate that a problem occured.
                return 0;
            }

            # Execute the query
            if (! execute_query(\$dbh, \$sth, $sql_01))
            {
                return 0;
            }

            # get all records in table for the data set
            while ($row1 = $sth->fetchrow_array)
            {
                push @ds1_rows, $row1;
            }

            # Close the database connection.
            if (! close_database_connection (\$dbh))
            {
                # Indicate that a problem occured.
                return 0;
            }

            #------------------
            # Data Set 2
            #------------------

            # Create the query statment
            $sql_02 = "SELECT * FROM " . $_;

            # Insert the where clause to the query.
            # This narrows the data chosen to the data_set the user is
+ currently looking at.
            insert_where_clause(\$sql_02, $where_clause_ds2);    

            # Connect to the database.
            if (! open_database_connection 
                    (    \$dbh,
                        $program_config{database},
                        $program_config{admin_login},
                        $program_config{admin_pass}
                    )
                )
            {
                # Indicate that a problem occured.
                return 0;
            }

            # Execute the query
            if (! execute_query(\$dbh, \$sth, $sql_02))
            {
                return 0;
            }

            # get all records in table for the data set
            while ($row2 = $sth->fetchrow_array)
            {
                push @ds2_rows, $row2;
            }

                print $ds2_rows[0]."\n";

            # Close the database connection.
            if (! close_database_connection (\$dbh))
            {
                # Indicate that a problem occured.
                return 0;
            }

            #------------------------------------
            # Compare rows & output differences
            #------------------------------------

            # compare each row in the first set of data to every row
            # in the second set of data.  If it does not occur, then 
            # they are different, so print.
            foreach (@ds1_rows) 
            {
                my $record1 = $_;

                foreach (@ds2_rows) 
                {
                    my $record2 = $_;
                    
                    # if the record cannot be found
                    if ($record1 ne $record2) 
                    {
                        print $record1."\n".$record2."\n\n";
                    }
                }
            }

            # empty arrays for next table
            @ds1_rows = ();
            @ds2_rows = ();
        }
[download]

Originally I thought $sth->fetchrow_array would simply retrieve the whole record (row) and i could work from there and do a couple of nested foreach loops and compare the rows. Then repeat the process for the other way round. It seems to only retrieve the last field value?

Sorry for the very long question but I thought i'd better explain it properly to make sure you guys understood what i meant! Also please note, the db functions used here definately work. Hope there is someone who can point me in the right direction!

thanks, Jason

Comment on Compare records from different data sets Select or Download Code

Replies are listed 'Best First'.
Re: Compare records from different data sets by jhourcle (Prior) on Mar 10, 2005 at 03:03 UTC
Perl is ideal for comparing strings. That being said, I would either do the comparisons on strings, or I'd do all of the work in the database, if I wanted to deal with it as records. We assume we have a function `get_records( $sql )` that returns a single scalar for each record, given the sql string, you'd use something like the following general logic: (you may have to adjust to deal with your database's concatenation operator, and select a deliminator that doesn't occur in your fields) `my $fields = join "\|\|'\|'\|\|", qw( field1, field2, field3); my %set1 = map { $_ => 1 } get_records ( "SELECT $fields FROM table WHERE data_set = 'set1'"); my %set2 = map { $_ => 1 } get_records ( "SELECT $fields FROM table WHERE data_set = 'set2'"); my @set2not1 = grep { ! exists($set1{$_}) } keys %set2; my @set1not2 = grep { ! exists($set2{$_}) } keys %set1;` [download] You can then split the records back out into their individual fields. But there's no reason to go and load all of the records into memory, when you already have them in the database, which is made specifically to handle dealing with records. If you're using a database that can take subqueries, you can do the relationship in one statement. If you're not, you need some temporary tables or views -- `CREATE VIEW set1 AS SELECT field1,field2,field3 FROM table WHERE data_set = 'set1'; CREATE VIEW set2 AS SELECT field1,field2,field3 FROM table WHERE data_set = 'set2';` [download] You can then use a non-equijoin in Oracle, or a right join in mySQL, or whatever you have in your particular database to get a list of records that don't have a correlation. Oh -- and you might want to look over the documentation for DBI, because `fetchrow_array()` returns a list for each element, which you're then evaluating in scalar context, and would get the count of the fields for each item. You probably wanted `fetchrow_arrayref()`	[reply] [d/l] [select]
Re^2: Compare records from different data sets by Anonymous Monk on Mar 10, 2005 at 14:24 UTC
Thats an awsome response, thanks for your help!	[reply]