zing has asked for the wisdom of the Perl Monks concerning the following question:

Hi all, I have this file
__DATA__ child, Parent, probability, M7, Q, P, M7, M28, E, M28, M6, E, M6, Q, Pl, & several hundred lines.....
Legends: P(= Probable) > Pl(=Plausible) > E(=Equivocal). What I want is for each child I want to trace it back to Q(Query), but I need the shortest path which leads it to the Q(Query) along with their probabilities. For example for the input data shown above the output should be :-
__OUTPUT_1__ M7: M7<-Q = P M28: M28<-M6<-Q = Pl.E M6: M6<-Q = Pl
But as we can see from second row of input data M7 has another longer path tracing to Q : M7<-M28<-M6<-Q = Pl.E.E. But the code should have an option to neglect the largest path and thus show only the shortest path OR to show all of them. i.e.
__OUTPUT_2__ M7: M7<-Q = P M7: M7<-M28<-M6<-Q = Pl.E.E M28: M28<-M6<-Q = Pl.E M6: M6<-Q = Pl
Thus this second output prints a path tracing back to Q for each of the rows, so if we have N input rows to the program , we will have N corresponding output rows.

Replies are listed 'Best First'.
Re: find shortest path for each query from a CSV file
by marinersk (Priest) on Nov 22, 2013 at 06:37 UTC
    I'd recommend Text::CSV_XS for reading the CSV file, though the structure seems reliable enough to simply use split if you wish.

    I'd start by storing the data in a hash of hashes, but perhaps hash of arrays will be needed if multiple children nodes are possible.

    There might be some useful tools in List::Utils, but if not, I'd probably store a sister element with depth as I walked through the tree (more efficient to build that whilst assembling the master tree, I should think).

    From there it is likely to be straight subtraction of the depth values and select the smallest one(s). Would likely expect a tree walk for that part.

    Probably build a reverse path hash on the way in, report on it on the way out.

    For more details, post some code and show what isn't working as expected.

      The code as requested. But its printing only the longest path.

Re: find shortest path for each query from a CSV file
by Tux (Canon) on Nov 22, 2013 at 07:18 UTC

    Read the data with Text::CSV_XS (or Text::CSV) and then model into a Graph where you can use methods like SPT_Dijkstra, SPT_Bellman_Ford, APSP_Floyd_Warshall and their variants.

    Searching CPAN might help too with various solutions to shortest-path problems.


    Enjoy, Have FUN! H.Merijn
Re: find shortest path for each query from a CSV file
by GrandFather (Saint) on Nov 22, 2013 at 11:53 UTC

    The following looks to do the shortest path variant:

    #!/usr/bin/perl use warnings; use strict; my %group = ( # Hash table/dictionary for all the groups 'P' => 'I_1', 'Pl' => 'I_2', 'P.P' => 'I_3', 'P.Pl' => 'I_4', 'Pl.P' => 'I_5', 'Pl.Pl' => 'I_6', 'P.P.P' => 'I_7', 'P.P.Pl' => 'I_8', 'P.Pl.P' => 'I_9', 'P.Pl.Pl' => 'I_10', 'Pl.P.P' => 'I_11', 'Pl.P.Pl' => 'I_12', 'Pl.Pl.P' => 'I_13', 'Pl.Pl.Pl' => 'I_14', 'E' => 'II_15', 'P.E' => 'II_16', 'Pl.E' => 'II_17', 'P.P.E' => 'II_18', 'P.Pl.E' => 'II_19', 'Pl.P.E' => 'II_20', 'Pl.Pl.E' => 'II_21', 'E.P' => 'III_22', 'E.Pl' => 'III_23', 'P.E.P' => 'III_24', 'P.E.Pl' => 'III_25', 'Pl.E.P' => 'III_26', 'Pl.E.Pl' => 'III_27', 'E.P.P' => 'III_28', 'E.P.Pl' => 'III_29', 'E.Pl.P' => 'III_30', 'E.Pl.Pl' => 'III_31', 'E.E' => 'IV_32', 'P.E.E' => 'IV_33', 'Pl.E.E' => 'IV_34', 'E.P.E' => 'IV_35', 'E.Pl.E' => 'IV_36', 'E.E.P' => 'IV_37', 'E.E.Pl' => 'IV_38', 'E.E.E' => 'IV_39', ); <DATA>; # Skip the headers (first row). my %tree; while (<DATA>) { # parse through the input data and fill in our tree data structure chomp; my ($child, $parent, $prob) = split /\t/; if ($child eq 'Q') { push @{$tree{$child}}, {parent => '', prob => $prob, dist => 0 +}; next; } if ($parent eq 'Q') { push @{$tree{$child}}, {parent => $parent, prob => $prob, dist + => 1}; next; } for my $opt (@{$tree{$parent}}) { my $dist = $opt->{dist} + 1; push @{$tree{$child}}, {parent => $parent, prob => $prob, dist => $dist}; } } for my $child (sort {length $a <=> length $b or $a cmp $b} keys %tree) + { my @bestPath = findBestPath($child, \%tree); my $probs = join '.', map {$_->{prob}} @bestPath; printf "%-5s ", "$child:"; # Join the likelihood path. Then if group is found for a likelihoo +d #from the group hash table then print it, else quit print join '<-', $child, grep {$_} map {$_->{parent}} @bestPath; print ", $probs"; print ", $group{$probs}" if exists $group{$probs}; print "\n"; } sub findBestPath { my ($child, $tree) = @_; return $tree->{Q}[0] if $child eq 'Q'; my @alts = sort {$a->{dist} <=> $b->{dist}} @{$tree->{$child}}; return $alts[0], findBestPath($alts[0]{parent}, $tree); } __DATA__ child, Parent, likelihood M7 Q P M54 M7 Pl M213 M54 E M206 M54 E M194 M54 E ...

    Prints (in part):

    Q: Q, E, II_15 M6: M6<-Q, E.E, IV_32 M7: M7<-Q, P.E, II_16 M10: M10<-Q, E.E, IV_32 M13: M13<-M7<-Q, E.P.E, IV_35 M17: M17<-Q, P.E, II_16 M18: M18<-Q, E .E M22: M22<-Q, E.E, IV_32 M23: M23<-Q, E.E, IV_32 M28: M28<-M6<-Q, P.E.E, IV_33 M33: M33<-M28<-M6<-Q, E.P.E.E
    True laziness is hard work
      Sorry but Im getting this error (even though I have tried download link under your code) :-
      Use of uninitialized value in join or string at check_22nov_metabolite +_pred_2.pl line 74, <DATA> line 6. M7: M7<-Q, P. Use of uninitialized value in join or string at check_22nov_metabolite +_pred_2.pl line 74, <DATA> line 6. M54: M54<-M7<-Q, Pl.P. Use of uninitialized value in join or string at check_22nov_metabolite +_pred_2.pl line 74, <DATA> line 6. M194: M194<-M54<-M7<-Q, E.Pl.P. Use of uninitialized value in join or string at check_22nov_metabolite +_pred_2.pl line 74, <DATA> line 6. M206: M206<-M54<-M7<-Q, E.Pl.P. Use of uninitialized value in join or string at check_22nov_metabolite +_pred_2.pl line 74, <DATA> line 6. M213: M213<-M54<-M7<-Q, E.Pl.P.

        I truncated the data in the code I posted to reduce the number of uninteresting lines. The '...' is an ellipsis and is used to indicate missing data. If you substitute the data from Re^2: find shortest path for each query from a CSV file the code runs correctly without warnings.

        True laziness is hard work
      Please help Im still getting this error :-
      Use of uninitialized value in join or string at check_22nov_metabolite +_pred_2.pl line 74, <DATA> line 6. M7: M7<-Q, P. Use of uninitialized value in join or string at check_22nov_metabolite +_pred_2.pl line 74, <DATA> line 6. M54: M54<-M7<-Q, Pl.P.

        Just glancing at the code GrandFather posted I see a couple of spots where term 'prob' appears to be a bare word. At some point should it be a variable? I could be wrong here. There are also three dots in the __DATA__ portion that should probably be deleted too.

      Sorry but theres a problem with the code. For example consider 2nd line of you output. Its giving double probablity in third column (E.E) M6:   M6<-Q, E.E, IV_32 Whereas according to the __DATA__ M6 is coming directly from Q

      M6    Q    E

      Thus correct output should be :-  M6:   M6<-Q, E, II_15

      To give you an intuition this line in data means that the probability of M6 coming from Q is 'E'. That is what I want in third column. If suppose M6 were coming from M76 which in turn comes from Q  M6<-M76<-Q and the input data for these were

      __DATA__ M6 M76 E M76 Q E
      Then in this case M6: M6<-M76<-Q, E.E, IV_32 would have been a correct output.

      So basically the third column is giving wront output, due to which fourth column is also giving incorrect results as it is based on third for its input.

        Sorry, but any problem with the code is now your problem. A trivial examination of the output and thinking about it will tell you why it is as it as. Feel free to correct the code as you see fit.

        True laziness is hard work
Re: find shortest path for each query from a CSV file
by hdb (Monsignor) on Nov 22, 2013 at 10:11 UTC

    I could not resist but I will leave the testing to you. I also suspect you have circles in your data. Being lazy, I have also used some global variables.

    use warnings; no warnings 'recursion'; use strict; <DATA>; # Skip the headers (first row). my %edges; my %kids; while (<DATA>) { chomp; my ($child, $parent, $likelihood) = split /\t/; $edges{$parent}{$child} = $likelihood; $kids{$child} = 1; } my $nkids = scalar keys %kids; my %pathes = ( Q => { path => [ 'Q' ], lh => [] } ); sub shortest_pathes { my( $end, $level ) = @_; return if $level > $nkids; return unless $edges{$end}; my @lh = @{$pathes{$end}{lh}}; my @path = @{$pathes{$end}{path}}; my @pre = keys %{$edges{$end}}; for( @pre ) { if( ( !exists $pathes{$_} ) or ( scalar @{$pathes{$_}{ +path}} > 1 + scalar @path ) ) { $pathes{$_} = { path => [ $_, @path ], lh => [ + $edges{$end}{$_}, @lh ] }; } } shortest_pathes( $_, $level+1 ) for @pre; } shortest_pathes 'Q', 0; delete $pathes{Q}; for (sort {substr($a,1) <=> substr($b,1)} keys %pathes) { print "$_\t"; print join "<-", @{$pathes{$_}{path}}; print "\t"; print join ".", @{$pathes{$_}{lh}}; print "\n"; } __DATA__ child, Parent, likelihood M7 Q P M54 M7 Pl M213 M54 E M206 M54 E M194 M54 E M53 M7 Pl M186 M53 Pl M194 M53 Pl M187 M53 E M204 M53 E M201 M53 E M202 M53 E M179 M53 E M13 M53 E M157 M53 E M173 M53 E M205 M53 E M195 M53 E M196 M53 E M197 M53 E M198 M53 E M57 M7 E M44 M7 E M61 M7 E M13 M7 E M50 M7 E M158 M50 P M157 M50 P M153 M50 Pl M162 M50 E M164 M50 E M165 M50 E M147 M50 E M159 M50 E M160 M50 E M161 M50 E M51 M7 E M174 M51 Pl M50 M51 Pl M158 M50 P M157 M50 P M153 M50 Pl M162 M50 E M164 M50 E M165 M50 E M147 M50 E M159 M50 E M160 M50 E M161 M50 E M173 M51 Pl M175 M51 E M176 M51 E M62 M7 E M55 M7 E M17 Q P M83 M17 Pl M54 M17 Pl M213 M54 E M206 M54 E M194 M54 E M84 M17 E M101 M17 E M98 M17 E M99 M17 E M76 M17 E M206 M76 P M278 M76 E M289 M76 E M13 M76 E M282 M76 E M283 M76 E M294 M76 E M285 M76 E M13 M17 E M88 M17 E M158 M88 P M306 M88 Pl M307 M88 Pl M317 M88 E M318 M88 E M319 M88 E M320 M88 E M282 M88 E M299 M88 E M311 M88 E M312 M88 E M313 M88 E M314 M88 E M315 M88 E M316 M88 E M89 M17 E M174 M89 P M88 M89 P M158 M88 P M306 M88 Pl M307 M88 Pl M317 M88 E M318 M88 E M319 M88 E M320 M88 E M282 M88 E M299 M88 E M311 M88 E M312 M88 E M313 M88 E M314 M88 E M315 M88 E M316 M88 E M326 M89 Pl M333 M89 E M335 M89 E M336 M89 E M283 M89 E M330 M89 E M331 M89 E M332 M89 E M102 M17 E M92 M17 E M93 M17 E M94 M17 E M95 M17 E M22 Q E M6 Q E M28 M6 P M115 M28 Pl M114 M28 Pl M112 M28 E M117 M28 E M113 M28 E M107 M28 E M33 M28 E M50 M28 E M158 M50 P M157 M50 P M153 M50 Pl M162 M50 E M164 M50 E M165 M50 E M147 M50 E M159 M50 E M160 M50 E M161 M50 E M51 M28 E M174 M51 Pl M50 M51 Pl M158 M50 P M157 M50 P M153 M50 Pl M162 M50 E M164 M50 E M165 M50 E M147 M50 E M159 M50 E M160 M50 E M161 M50 E M173 M51 Pl M175 M51 E M176 M51 E M7 M28 E M54 M7 Pl M213 M54 E M206 M54 E M194 M54 E M53 M7 Pl M186 M53 Pl M194 M53 Pl M187 M53 E M204 M53 E M201 M53 E M202 M53 E M179 M53 E M13 M53 E M157 M53 E M173 M53 E M205 M53 E M195 M53 E M196 M53 E M197 M53 E M198 M53 E M57 M7 E M44 M7 E M61 M7 E M13 M7 E M50 M7 E M158 M50 P M157 M50 P M153 M50 Pl M162 M50 E M164 M50 E M165 M50 E M147 M50 E M159 M50 E M160 M50 E M161 M50 E M51 M7 E M174 M51 Pl M50 M51 Pl M158 M50 P M157 M50 P M153 M50 Pl M162 M50 E M164 M50 E M165 M50 E M147 M50 E M159 M50 E M160 M50 E M161 M50 E M173 M51 Pl M175 M51 E M176 M51 E M62 M7 E M55 M7 E M36 M6 P M126 M36 Pl M115 M36 Pl M131 M36 E M132 M36 E M127 M36 E M138 M36 E M139 M36 E M120 M36 E M33 M36 E M88 M36 E M158 M88 P M306 M88 Pl M307 M88 Pl M317 M88 E M318 M88 E M319 M88 E M320 M88 E M282 M88 E M299 M88 E M311 M88 E M312 M88 E M313 M88 E M314 M88 E M315 M88 E M316 M88 E M89 M36 E M174 M89 P M88 M89 P M158 M88 P M306 M88 Pl M307 M88 Pl M317 M88 E M318 M88 E M319 M88 E M320 M88 E M282 M88 E M299 M88 E M311 M88 E M312 M88 E M313 M88 E M314 M88 E M315 M88 E M316 M88 E M326 M89 Pl M333 M89 E M335 M89 E M336 M89 E M283 M89 E M330 M89 E M331 M89 E M332 M89 E M134 M36 E M135 M36 E M136 M36 E M17 M36 E M83 M17 Pl M54 M17 Pl M213 M54 E M206 M54 E M194 M54 E M84 M17 E M101 M17 E M98 M17 E M99 M17 E M76 M17 E M206 M76 P M278 M76 E M289 M76 E M13 M76 E M282 M76 E M283 M76 E M294 M76 E M285 M76 E M13 M17 E M88 M17 E M158 M88 P M306 M88 Pl M307 M88 Pl M317 M88 E M318 M88 E M319 M88 E M320 M88 E M282 M88 E M299 M88 E M311 M88 E M312 M88 E M313 M88 E M314 M88 E M315 M88 E M316 M88 E M89 M17 E M174 M89 P M88 M89 P M158 M88 P M306 M88 Pl M307 M88 Pl M317 M88 E M318 M88 E M319 M88 E M320 M88 E M282 M88 E M299 M88 E M311 M88 E M312 M88 E M313 M88 E M314 M88 E M315 M88 E M316 M88 E M326 M89 Pl M333 M89 E M335 M89 E M336 M89 E M283 M89 E M330 M89 E M331 M89 E M332 M89 E M102 M17 E M92 M17 E M93 M17 E M94 M17 E M95 M17 E M34 M6 E M35 M6 E Q M6 E M10 Q E M46 M10 P M216 M46 P M189 M46 P M244 M46 E M246 M46 E M247 M46 E M248 M46 E M249 M46 E M243 M46 E M70 M10 P M257 M70 Pl M216 M70 Pl M267 M70 E M265 M70 E M266 M70 E M250 M70 E M268 M70 E M269 M70 E M270 M70 E M261 M70 E M262 M70 E M263 M70 E M264 M70 E M72 M10 E M75 M10 E M71 M10 E M23 Q E M18 Q E
      Thanks pal. What if I also wanted a 4th column for corresponding groups for each likelihood combination in the output. For example , which would then take values from this hash :-
      my %group = ( # Hash table/dictionary for all the group +s 'P'=> 'I_1', 'Pl'=>'I_2', 'P.P'=>'I_3', 'P.Pl'=>'I_4', 'Pl.P'=>'I_5', 'Pl.Pl'=>'I_6', 'P.P.P'=>'I_7', 'P.P.Pl'=>'I_8', 'P.Pl.P'=>'I_9', 'P.Pl.Pl'=>'I_10', 'Pl.P.P'=>'I_11', 'Pl.P.Pl'=>'I_12', 'Pl.Pl.P'=>'I_13', 'Pl.Pl.Pl'=>'I_14', 'E'=> 'II_15', 'P.E' => 'II_16', 'Pl.E' => 'II_17', 'P.P.E'=>'II_18', 'P.Pl.E'=>'II_19', 'Pl.P.E'=>'II_20', 'Pl.Pl.E'=>'II_21', 'E.P'=> 'III_22', 'E.Pl'=>'III_23', 'P.E.P'=>'III_24', 'P.E.Pl'=>'III_25', 'Pl.E.P'=>'III_26', 'Pl.E.Pl'=>'III_27', 'E.P.P'=>'III_28', 'E.P.Pl'=>'III_29', 'E.Pl.P'=>'III_30', 'E.Pl.Pl'=>'III_31', 'E.E'=>'IV_32', 'P.E.E'=>'IV_33', 'Pl.E.E'=>'IV_34', 'E.P.E'=>'IV_35', 'E.Pl.E'=>'IV_36', 'E.E.P'=>'IV_37', 'E.E.Pl'=>'IV_38', 'E.E.E'=>'IV_39', );
      So if my third column "in the output" has value P.P.P, fourth column will have the group_id (taken from the "group" hash) as I_7, and so on.
Re: find shortest path for each query from a CSV file
by hdb (Monsignor) on Nov 22, 2013 at 07:12 UTC
      Actually that solution is showing only the longest path. I dont know how to tweak it now !!!!!
Re: find shortest path for each query from a CSV file
by LanX (Saint) on Nov 22, 2013 at 23:31 UTC
    Did you try good old Dijkstra's algorithm ?

    Cheers Rolf

    ( addicted to the Perl Programming Language)

Re: find shortest path for each query from a CSV file
by oiskuu (Hermit) on Nov 23, 2013 at 16:57 UTC
    Probably rather inefficient, but it might do.
    #! /usr/bin/perl my %DEF = ( I => [qw( P Pl P.P P.Pl Pl.P Pl.Pl P.P.P P.P.Pl P.Pl.P P.Pl.Pl Pl. +P.P Pl.P.Pl Pl.Pl.P Pl.Pl.Pl )], II => [qw( E P.E Pl.E P.P.E P.Pl.E Pl.P.E Pl.Pl.E )], III => [qw( E.P E.Pl P.E.P P.E.Pl Pl.E.P Pl.E.Pl E.P.P E.P.Pl E.Pl.P + E.Pl.Pl )], IV => [qw( E.E P.E.E Pl.E.E E.P.E E.Pl.E E.E.P E.E.Pl E.E.E )] ); my @rank = map @$_, @DEF{qw(I II III IV)}; my %rank = map {$rank[$_-1] => $_} 1..@rank; my @group = map {($_) x @{$DEF{$_}}} qw(I II III IV); my %group = map {$rank[$_-1] => $group[$_-1]."_".$_} 1..@group; sub rank { $rank{$a->[2]} <=> $rank{$b->[2]} } my %T; sub oh { map values %$_, @_ } sub ab { my ($b, $a) = @_; [$b->[0], $a->[1], qq($a->[2].$b->[2]), qq($b->[3]<-$a->[3])] } sub xtend { my $a = shift; map {ab $_, $a} oh @{$T{$a->[0]}}{@_} } sub ins { $T{$_[3] //= $_[1]}{$_[2]}{$_[0]} = \@_ } ins split /,\s*/ for <DATA>; ins @$_ for map {xtend $_, qw(P E Pl)} (oh oh oh \%T); ins @$_ for map {xtend $_, qw(P E Pl)} (oh oh oh \%T); for (sort {rank} grep {$_->[1] eq 'Q'} (oh oh oh \%T)) { printf "%-4s: %20s, %-8s %6s\n", $_->[0], qq($_->[0]<-$_->[3]), $_->[2], $group{$_->[2]}; } __DATA__ M7, Q, P, M7, M28, E, M28, M6, E, M6, Q, Pl,
      Thanks but your code doesnt print anything on this data
      __DATA__ M7 Q P M54 M7 Pl M213 M54 E M206 M54 E M194 M54 E M53 M7 Pl M186 M53 Pl M194 M53 Pl M187 M53 E M204 M53 E M201 M53 E M202 M53 E M179 M53 E M13 M53 E M157 M53 E M173 M53 E M205 M53 E M195 M53 E M196 M53 E M197 M53 E M198 M53 E M57 M7 E M44 M7 E M61 M7 E M13 M7 E M50 M7 E M158 M50 P M157 M50 P M153 M50 Pl M162 M50 E M164 M50 E M165 M50 E M147 M50 E M159 M50 E