Re: Advise on sql table mapping.
by Corion (Patriarch) on May 17, 2008 at 21:53 UTC
|
Basically, you want to find the path connecting two tables. As the total number of tables is likely very finite (I suspect less than 4000, to pull a number out of my ass), and the number of joins likely also is very finite (let's say, each table links to two other tables, so roughly 8000 links) you can use brute force to enumerate all paths between two tables.
As you likely want to find a shortest connection according to some metric, you will have to enumerate them all and then select one of the shortest paths. There can be obviously more than one shortest path.
For generic and relatively efficient path finding, see the A_Star algorithm.
If you want to get ambitious, you can use AI::Prolog or just Prolog directly to let it tell you the links between two tables.
Of course, you can also program the search directly, but this is untested code I write off my cuff, so caveat emptor:
use strict;
use Data::Dumper;
my @links = map { [/^(\w+)\s+(\w+)/] } split /\n/, <<LINKS;
apple bug unid1 unid2
apple cat unid1 unid3
bug dog unid2 unid4
apple dog unid1 unid4
LINKS
# All links work both ways:
my %links;
for (@links) {
my ($left,$right) = @$_;
# We store all links in arrays
$links{$left} ||= [];
$links{$right} ||= [];
# Links work both ways
push @{$links{$left}}, $right;
push @{$links{$right}}, $left;
};
sub find_joins {
my ($start, $goal, $prefix, $visited) = @_;
$visited ||= {}; # our breadcrumb trail so we don't run around in
+circles
$prefix ||= []; # the path so far
my $curr = $start;
warn "[@$prefix] $start => $goal";
my @results; # All the paths that lead from $start to $end
$visited->{$start} = 1;
for my $link (grep { ! $visited->{$_} } @{$links{$start}}) {
if ($link eq $goal) {
push @results, [@$prefix,$start,$goal];
} else {
my %v = %$visited;
push @results, find_joins($link,$goal,[@$prefix,$start],\%
+v);
};
};
# Sort by length, then asciibetically
@results = sort { scalar @$a <=> scalar @$b or $a->[0] cmp $b->[0]
+ } @results;
return @results;
};
print Dumper find_joins( apple => 'bug' );
print Dumper find_joins( apple => 'dog' );
print Dumper find_joins( cat => 'dog' );
__END__
[] apple => bug at tmp.pl line 31.
[apple] cat => bug at tmp.pl line 31.
[apple] dog => bug at tmp.pl line 31.
$VAR1 = [
'apple',
'bug'
];
$VAR2 = [
'apple',
'dog',
'bug'
];
[] apple => dog at tmp.pl line 31.
[apple] bug => dog at tmp.pl line 31.
[apple] cat => dog at tmp.pl line 31.
$VAR1 = [
'apple',
'dog'
];
$VAR2 = [
'apple',
'bug',
'dog'
];
[] cat => dog at tmp.pl line 31.
[cat] apple => dog at tmp.pl line 31.
[cat apple] bug => dog at tmp.pl line 31.
$VAR1 = [
'cat',
'apple',
'dog'
];
$VAR2 = [
'cat',
'apple',
'bug',
'dog'
];
An interesting invariant that you should maybe check in the code would be that the joins must be symmetric, that is, all paths which are found for getting from $x to $y need
also be found when getting from $y to $x. I haven't checked that my brute force solution returns symmetric results.
Conversion of the output into SQL statements is left as an exercise. | [reply] [d/l] [select] |
|
|
print Dumper find_joins( apple => 'dog' );
[] apple => dog at tmp.pl line 31.
[apple] bug => dog at tmp.pl line 31.
[apple] cat => dog at tmp.pl line 31.
$VAR1 = [
'apple',
'dog'
];
$VAR2 = [
'apple',
'bug',
'dog'
];
After all this is over, all that will really have mattered is how we treated each other.
| [reply] [d/l] |
|
|
This is the data structure that is returned from find_joins. The first line seems to be pasted from the source code. The next three lines beginning with [...] are debugging output of find_joins as it progresses its work. See line 31 for that. It outputs the current path, the current start and the current goal. The last lines are the output of Data::Dumper which displays the data structure returned. Maybe you can tell me in more detail where you have problems. If it is with complex data structures, maybe perldsc or Quick References Reference help you.
| [reply] [d/l] [select] |
|
|
|
|
|
|
#!/usr/bin/perl -w
my @links = map { [/^(\w+)\s+(\w+)/] } split /\n/, <<LINKS;
apple bug unid1 unid2
apple cat unid1 unid3
bug dog unid2 unid4
apple dog unid1 unid4
dog owl unid4 unid5
LINKS
# All links work both ways:
my %links;
for (@links) {
my ($left,$right) = @$_;
# We store all links in arrays
$links{$left} ||= [];
$links{$right} ||= [];
# Links work both ways
push @{$links{$left}}, $right;
push @{$links{$right}}, $left;
}
sub find_joins {
my ($start, $goal, $prefix, $visited) = @_;
$visited ||= {}; # our breadcrumb trail so we dont
$prefix ||= []; # the path so far
my $curr = $start;
# warn "[@$prefix] $start => $goal";
my @results; # All the paths that lead from $start to $end
$visited->{$start} = 1;
for my $link (grep { ! $visited->{$_} } @{$links{$start}}) {
if ($link eq $goal) {
push @results, [@$prefix,$start,$goal];
} else {
my %v = %$visited;
push @results, find_joins($link,$goal,[@$prefix,$start],\%
+v);
}
}
# Sort by length, then asciibetically
@results = sort { scalar @$a <=> scalar @$b or $a->[0] cmp $b->[0]
+ } @results;
# print the whole thing with indices
for $i ( 0 .. $#results ) {
print "@{$results[$i]}\n";
}
}
find_joins( 'apple' => 'owl' );
Output:
dbmathis@dbmathis-linux:~$ ./tmp.pl
apple bug dog owl
apple dog owl
Use of uninitialized value in string comparison (cmp) at ./tmp.pl line
+ 46.
Use of uninitialized value in string comparison (cmp) at ./tmp.pl line
+ 46.
dbmathis@dbmathis-linux:~$
After all this is over, all that will really have mattered is how we treated each other.
| [reply] [d/l] [select] |
|
|
The print statements will also try out to print stuff when no path can be found. You shouldn't print out stuff within find_joins, you should print out the results of the top-level call to it. Leaving the warn statement active would have helped you see the flow of control better.
Why did you remove the use strict; line? Adding it again points to a potential error.
| [reply] [d/l] [select] |
|
|
|
|
|
|
|
Ok, here is my first attempt to converting your output into SQL. I have not added the table aliases into the code yet, and I am sure you can do this with far less code so let me know what you think.
I think the function calling itself makes sense now. I assume you are doing this to populate prefix. I may be wrong though.
#!/usr/bin/perl -w
use strict;
my @links = map { [/^(\w+)\s+(\w+)\s+(\w+)\s+(\w+)/] } split /\n/, <<L
+INKS;
apple bug unid1 unid2
apple cat unid1 unid3
bug dog unid2 unid4
apple dog unid1 unid4
dog owl unid4 unid5
LINKS
# All links work both ways:
my %links;
for (@links) {
my ($left,$right) = @$_;
# We store all links in arrays
$links{$left} ||= [];
$links{$right} ||= [];
# Links work both ways
push @{$links{$left}}, $right;
push @{$links{$right}}, $left;
}
sub find_joins {
my ($start, $goal, $prefix, $visited) = @_;
$visited ||= {}; # our breadcrumb trail so we dont go in circles
$prefix ||= []; # the path so far
my $curr = $start;
#warn "[@$prefix] $start => $goal";
my @results; # All the paths that lead from $start to $end
$visited->{$start} = 1;
for my $link (grep { ! $visited->{$_} } @{$links{$start}}) {
if ($link eq $goal) {
push @results, [@$prefix,$start,$goal];
for my $i ( 0 .. $#results ) {
for my $ii ( 0 .. $#{$results[$i]} ) {
if ( $ii == 0 ) {
print "$results[$i][$ii]\n";
} else {
my $keys;
for ( @links ) {
my ( $left, $right, $lid, $rid ) = @$_;
if ( $left eq $results[$i][$ii - 1] && $ri
+ght eq $results[$i][$ii] ) {
$keys = "on $lid = $rid";
}
}
print "inner join $results[$i][$ii] $keys\n";
}
}
print "\n";
}
} else {
my %v = %$visited;
push @results, find_joins($link,$goal,[@$prefix,$start],\%
+v);
}
}
# Sort by length, then asciibetically
@results = sort { scalar @$a <=> scalar @$b or $a->[0] cmp $b->[0]
+ } @results;
}
find_joins( 'apple' => 'owl' );
apple
inner join bug on unid1 = unid2
inner join dog on unid2 = unid4
inner join owl on unid4 = unid5
apple
inner join dog on unid1 = unid4
inner join owl on unid4 = unid5
After all this is over, all that will really have mattered is how we treated each other.
| [reply] [d/l] [select] |
|
|
Exactly - $prefix contains the path taken so far, that's what the recursion does.
Again, why are you trying to print out the stuff from within find_joins? Please, don't do that. Just don't. Take the results of find_joins and then, convert them to the output format. It's much easier that way instead of trying to make a recursive function output the stuff you want at the end.
Your way of iterating over arrays is cumbersome, as you never need the array index anywhere:
for my $i ( 0 .. $#results ) {
is easier written as
for my $element ( @results ) {
But other than that, your code is close once you just take the data structure returned by find_joins and work on that:
# Build a lookup table for "left_table;right_table" => [ join columns
+]
my %link_columns = map {
$_->[0] . ";" . $_->[1] => [ $_->[2], $_->[3] ],
$_->[1] . ";" . $_->[0] => [ $_->[3], $_->[2] ]
} @links;
my @results = find_joins( 'apple' => 'owl' );
die "No path from 'apple' to 'owl'"
unless @results;
my $shortest_path = $results[0];
my $start = shift @$shortest_path;
print "SELECT * FROM ";
print "$start\n";
my $left = $start;
for my $table (@$shortest_path) {
my $key = "$left;$table";
print "-- $key\n";
my $join_cols = $link_columns{$key};
print " inner join $table on $join_cols->[0] = $join_cols->[1]\n";
$left = $table;
};
| [reply] [d/l] [select] |
Re: Advise on sql table mapping.
by CountZero (Bishop) on May 18, 2008 at 17:16 UTC
|
I think you are missing som important info in your file: the type of the join.Are they all inner equi-joins? Or are some them left or right joins? It will make a hell of a difference as you could loose a lot of relevant records by assuming it are all inner equi-joins Also as soon as left or right joins are involved the solution is not symmetrical anymore as operations with outer joins (left or right) are not commutative. I'm not sure if they are associative, but a combination of inner and outer joins is certainly not associative.
CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James
| [reply] |
|
|
At this point left / right joins are not so important, however there will be situations where I will have "join table on a = b and b = c". I think for now I will work with getting a script functioning for basic joins and then move to something more complex.
I already feel like I have jumped into the deep end without knowing how to swim :)
After all this is over, all that will really have mattered is how we treated each other.
| [reply] |
Re: Advise on sql table mapping.
by EvanCarroll (Chaplain) on May 17, 2008 at 21:55 UTC
|
For parsing the format you can use my DataExtract::FixedWidth to help you with this project, I should have the next version up today.
my $de = DataExtract::FixedWidth->new(
cols => ['table 1', 'table 2' , 'key 1', 'key 2'],
header_row => 'table 1 table 2 key 1 key 2'
)
The next version would be slightly easier for this format.
my $de = DataExtract::FixedWidth->new( heuristic => \@lines );
for ( @lines ) {
next unless m/\w/; ## skips the definition row of ----
$de->parse_hash( $_ )
}
| [reply] [d/l] [select] |
Re: Advise on sql table mapping.
by dragonchild (Archbishop) on May 18, 2008 at 14:25 UTC
|
When I did something like this five years ago, I used Graph and the shortest path algorithms it provides. You can have the vertices be your tables and the edges describe the joins. That will preserve all the information you need.
My criteria for good software:
- Does it work?
- Can someone else come in, make a change, and be reasonably certain no bugs were introduced?
| [reply] |