dbmathis has asked for the wisdom of the Perl Monks concerning the following question:

I have a file that contains relational information for a database schema that I am working with and the file is in the following format. Each line has two tables and the keys they join on. (table 1 contains key 1 and table 2 contains key 2, table 1 and table 2 join on key 1 and key 2)

Example file table 1 table 2 key 1 key 2 ------- ------- ----- ----- apple bug unid1 unid2 apple cat unid1 unid3 bug dog unid2 unid4 apple dog unid1 unid4

The above is a very generic example of the format i am dealing with and my goal is the create a perl script the will allow me to specify two table names and then the script will output how they join.

Example 1: [dmathis@wildman dmathis]$ ./script apple bug The output: apple a join bug b on a.unid1 = b.unid2
Example 2: [dmathis@wildman dmathis]$ ./script apple dog The output: Option 1: apple a join bug b on a.unid1 = b.unid2 join dog d on b.unid2 = d.unid4 Option 2: apple a join dog d on a.unid1 = d.unid4

As you can see apple and dog don't join directly to each other in one scenario, but instead through bug. This has the potential to become very complex as with a bigger more complex example there would be multiple paths from table to table and I have no idea where to start. I have been brainstorming for a couple of weeks now but have come up fruitless so far.

Does anyone have any suggestions or modules that might help me?

Best Regards

David

After all this is over, all that will really have mattered is how we treated each other.

Replies are listed 'Best First'.
Re: Advise on sql table mapping.
by Corion (Patriarch) on May 17, 2008 at 21:53 UTC

    Basically, you want to find the path connecting two tables. As the total number of tables is likely very finite (I suspect less than 4000, to pull a number out of my ass), and the number of joins likely also is very finite (let's say, each table links to two other tables, so roughly 8000 links) you can use brute force to enumerate all paths between two tables.

    As you likely want to find a shortest connection according to some metric, you will have to enumerate them all and then select one of the shortest paths. There can be obviously more than one shortest path.

    For generic and relatively efficient path finding, see the A_Star algorithm.

    If you want to get ambitious, you can use AI::Prolog or just Prolog directly to let it tell you the links between two tables.

    Of course, you can also program the search directly, but this is untested code I write off my cuff, so caveat emptor:

    use strict; use Data::Dumper; my @links = map { [/^(\w+)\s+(\w+)/] } split /\n/, <<LINKS; apple bug unid1 unid2 apple cat unid1 unid3 bug dog unid2 unid4 apple dog unid1 unid4 LINKS # All links work both ways: my %links; for (@links) { my ($left,$right) = @$_; # We store all links in arrays $links{$left} ||= []; $links{$right} ||= []; # Links work both ways push @{$links{$left}}, $right; push @{$links{$right}}, $left; }; sub find_joins { my ($start, $goal, $prefix, $visited) = @_; $visited ||= {}; # our breadcrumb trail so we don't run around in +circles $prefix ||= []; # the path so far my $curr = $start; warn "[@$prefix] $start => $goal"; my @results; # All the paths that lead from $start to $end $visited->{$start} = 1; for my $link (grep { ! $visited->{$_} } @{$links{$start}}) { if ($link eq $goal) { push @results, [@$prefix,$start,$goal]; } else { my %v = %$visited; push @results, find_joins($link,$goal,[@$prefix,$start],\% +v); }; }; # Sort by length, then asciibetically @results = sort { scalar @$a <=> scalar @$b or $a->[0] cmp $b->[0] + } @results; return @results; }; print Dumper find_joins( apple => 'bug' ); print Dumper find_joins( apple => 'dog' ); print Dumper find_joins( cat => 'dog' ); __END__ [] apple => bug at tmp.pl line 31. [apple] cat => bug at tmp.pl line 31. [apple] dog => bug at tmp.pl line 31. $VAR1 = [ 'apple', 'bug' ]; $VAR2 = [ 'apple', 'dog', 'bug' ]; [] apple => dog at tmp.pl line 31. [apple] bug => dog at tmp.pl line 31. [apple] cat => dog at tmp.pl line 31. $VAR1 = [ 'apple', 'dog' ]; $VAR2 = [ 'apple', 'bug', 'dog' ]; [] cat => dog at tmp.pl line 31. [cat] apple => dog at tmp.pl line 31. [cat apple] bug => dog at tmp.pl line 31. $VAR1 = [ 'cat', 'apple', 'dog' ]; $VAR2 = [ 'cat', 'apple', 'bug', 'dog' ];

    An interesting invariant that you should maybe check in the code would be that the joins must be symmetric, that is, all paths which are found for getting from $x to $y need also be found when getting from $y to $x. I haven't checked that my brute force solution returns symmetric results.

    Conversion of the output into SQL statements is left as an exercise.

      Ok, so I don't follow the output of:

      print Dumper find_joins( apple => 'dog' ); [] apple => dog at tmp.pl line 31. [apple] bug => dog at tmp.pl line 31. [apple] cat => dog at tmp.pl line 31. $VAR1 = [ 'apple', 'dog' ]; $VAR2 = [ 'apple', 'bug', 'dog' ];


      After all this is over, all that will really have mattered is how we treated each other.

        This is the data structure that is returned from find_joins. The first line seems to be pasted from the source code. The next three lines beginning with [...] are debugging output of find_joins as it progresses its work. See line 31 for that. It outputs the current path, the current start and the current goal. The last lines are the output of Data::Dumper which displays the data structure returned.

        Maybe you can tell me in more detail where you have problems. If it is with complex data structures, maybe perldsc or Quick References Reference help you.

      I have been playing with the code you posted above. It's outputting the correct data but I am not sure why i am seeing some warnings and a few blank line.

      Modified:

      #!/usr/bin/perl -w my @links = map { [/^(\w+)\s+(\w+)/] } split /\n/, <<LINKS; apple bug unid1 unid2 apple cat unid1 unid3 bug dog unid2 unid4 apple dog unid1 unid4 dog owl unid4 unid5 LINKS # All links work both ways: my %links; for (@links) { my ($left,$right) = @$_; # We store all links in arrays $links{$left} ||= []; $links{$right} ||= []; # Links work both ways push @{$links{$left}}, $right; push @{$links{$right}}, $left; } sub find_joins { my ($start, $goal, $prefix, $visited) = @_; $visited ||= {}; # our breadcrumb trail so we dont $prefix ||= []; # the path so far my $curr = $start; # warn "[@$prefix] $start => $goal"; my @results; # All the paths that lead from $start to $end $visited->{$start} = 1; for my $link (grep { ! $visited->{$_} } @{$links{$start}}) { if ($link eq $goal) { push @results, [@$prefix,$start,$goal]; } else { my %v = %$visited; push @results, find_joins($link,$goal,[@$prefix,$start],\% +v); } } # Sort by length, then asciibetically @results = sort { scalar @$a <=> scalar @$b or $a->[0] cmp $b->[0] + } @results; # print the whole thing with indices for $i ( 0 .. $#results ) { print "@{$results[$i]}\n"; } } find_joins( 'apple' => 'owl' );
      Output:
      dbmathis@dbmathis-linux:~$ ./tmp.pl apple bug dog owl apple dog owl Use of uninitialized value in string comparison (cmp) at ./tmp.pl line + 46. Use of uninitialized value in string comparison (cmp) at ./tmp.pl line + 46. dbmathis@dbmathis-linux:~$
      After all this is over, all that will really have mattered is how we treated each other.

        The print statements will also try out to print stuff when no path can be found. You shouldn't print out stuff within find_joins, you should print out the results of the top-level call to it. Leaving the warn statement active would have helped you see the flow of control better.

        Why did you remove the use strict; line? Adding it again points to a potential error.

      Ok, here is my first attempt to converting your output into SQL. I have not added the table aliases into the code yet, and I am sure you can do this with far less code so let me know what you think.

      I think the function calling itself makes sense now. I assume you are doing this to populate prefix. I may be wrong though.

      #!/usr/bin/perl -w use strict; my @links = map { [/^(\w+)\s+(\w+)\s+(\w+)\s+(\w+)/] } split /\n/, <<L +INKS; apple bug unid1 unid2 apple cat unid1 unid3 bug dog unid2 unid4 apple dog unid1 unid4 dog owl unid4 unid5 LINKS # All links work both ways: my %links; for (@links) { my ($left,$right) = @$_; # We store all links in arrays $links{$left} ||= []; $links{$right} ||= []; # Links work both ways push @{$links{$left}}, $right; push @{$links{$right}}, $left; } sub find_joins { my ($start, $goal, $prefix, $visited) = @_; $visited ||= {}; # our breadcrumb trail so we dont go in circles $prefix ||= []; # the path so far my $curr = $start; #warn "[@$prefix] $start => $goal"; my @results; # All the paths that lead from $start to $end $visited->{$start} = 1; for my $link (grep { ! $visited->{$_} } @{$links{$start}}) { if ($link eq $goal) { push @results, [@$prefix,$start,$goal]; for my $i ( 0 .. $#results ) { for my $ii ( 0 .. $#{$results[$i]} ) { if ( $ii == 0 ) { print "$results[$i][$ii]\n"; } else { my $keys; for ( @links ) { my ( $left, $right, $lid, $rid ) = @$_; if ( $left eq $results[$i][$ii - 1] && $ri +ght eq $results[$i][$ii] ) { $keys = "on $lid = $rid"; } } print "inner join $results[$i][$ii] $keys\n"; } } print "\n"; } } else { my %v = %$visited; push @results, find_joins($link,$goal,[@$prefix,$start],\% +v); } } # Sort by length, then asciibetically @results = sort { scalar @$a <=> scalar @$b or $a->[0] cmp $b->[0] + } @results; } find_joins( 'apple' => 'owl' );

      apple inner join bug on unid1 = unid2 inner join dog on unid2 = unid4 inner join owl on unid4 = unid5 apple inner join dog on unid1 = unid4 inner join owl on unid4 = unid5
      After all this is over, all that will really have mattered is how we treated each other.

        Exactly - $prefix contains the path taken so far, that's what the recursion does.

        Again, why are you trying to print out the stuff from within find_joins? Please, don't do that. Just don't. Take the results of find_joins and then, convert them to the output format. It's much easier that way instead of trying to make a recursive function output the stuff you want at the end.

        Your way of iterating over arrays is cumbersome, as you never need the array index anywhere:

        for my $i ( 0 .. $#results ) {

        is easier written as

        for my $element ( @results ) {

        But other than that, your code is close once you just take the data structure returned by find_joins and work on that:

        # Build a lookup table for "left_table;right_table" => [ join columns +] my %link_columns = map { $_->[0] . ";" . $_->[1] => [ $_->[2], $_->[3] ], $_->[1] . ";" . $_->[0] => [ $_->[3], $_->[2] ] } @links; my @results = find_joins( 'apple' => 'owl' ); die "No path from 'apple' to 'owl'" unless @results; my $shortest_path = $results[0]; my $start = shift @$shortest_path; print "SELECT * FROM "; print "$start\n"; my $left = $start; for my $table (@$shortest_path) { my $key = "$left;$table"; print "-- $key\n"; my $join_cols = $link_columns{$key}; print " inner join $table on $join_cols->[0] = $join_cols->[1]\n"; $left = $table; };
Re: Advise on sql table mapping.
by CountZero (Bishop) on May 18, 2008 at 17:16 UTC
    I think you are missing som important info in your file: the type of the join.

    Are they all inner equi-joins? Or are some them left or right joins?

    It will make a hell of a difference as you could loose a lot of relevant records by assuming it are all inner equi-joins

    Also as soon as left or right joins are involved the solution is not symmetrical anymore as operations with outer joins (left or right) are not commutative. I'm not sure if they are associative, but a combination of inner and outer joins is certainly not associative.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

      At this point left / right joins are not so important, however there will be situations where I will have "join table on a = b and b = c". I think for now I will work with getting a script functioning for basic joins and then move to something more complex.

      I already feel like I have jumped into the deep end without knowing how to swim :)

      After all this is over, all that will really have mattered is how we treated each other.
Re: Advise on sql table mapping.
by EvanCarroll (Chaplain) on May 17, 2008 at 21:55 UTC
    For parsing the format you can use my DataExtract::FixedWidth to help you with this project, I should have the next version up today.
    my $de = DataExtract::FixedWidth->new( cols => ['table 1', 'table 2' , 'key 1', 'key 2'], header_row => 'table 1 table 2 key 1 key 2' )
    The next version would be slightly easier for this format.
    my $de = DataExtract::FixedWidth->new( heuristic => \@lines ); for ( @lines ) { next unless m/\w/; ## skips the definition row of ---- $de->parse_hash( $_ ) }


    Evan Carroll
    I hack for the ladies.
    www.EvanCarroll.com
Re: Advise on sql table mapping.
by dragonchild (Archbishop) on May 18, 2008 at 14:25 UTC
    When I did something like this five years ago, I used Graph and the shortest path algorithms it provides. You can have the vertices be your tables and the edges describe the joins. That will preserve all the information you need.

    My criteria for good software:
    1. Does it work?
    2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?