Re: find shortest path for each query from a CSV file

The following looks to do the shortest path variant:

#!/usr/bin/perl
use warnings;
use strict;

my %group = (    # Hash table/dictionary for all the  groups
    'P'        => 'I_1',
    'Pl'       => 'I_2',
    'P.P'      => 'I_3',
    'P.Pl'     => 'I_4',
    'Pl.P'     => 'I_5',
    'Pl.Pl'    => 'I_6',
    'P.P.P'    => 'I_7',
    'P.P.Pl'   => 'I_8',
    'P.Pl.P'   => 'I_9',
    'P.Pl.Pl'  => 'I_10',
    'Pl.P.P'   => 'I_11',
    'Pl.P.Pl'  => 'I_12',
    'Pl.Pl.P'  => 'I_13',
    'Pl.Pl.Pl' => 'I_14',
    'E'        => 'II_15',
    'P.E'      => 'II_16',
    'Pl.E'     => 'II_17',
    'P.P.E'    => 'II_18',
    'P.Pl.E'   => 'II_19',
    'Pl.P.E'   => 'II_20',
    'Pl.Pl.E'  => 'II_21',
    'E.P'      => 'III_22',
    'E.Pl'     => 'III_23',
    'P.E.P'    => 'III_24',
    'P.E.Pl'   => 'III_25',
    'Pl.E.P'   => 'III_26',
    'Pl.E.Pl'  => 'III_27',
    'E.P.P'    => 'III_28',
    'E.P.Pl'   => 'III_29',
    'E.Pl.P'   => 'III_30',
    'E.Pl.Pl'  => 'III_31',
    'E.E'      => 'IV_32',
    'P.E.E'    => 'IV_33',
    'Pl.E.E'   => 'IV_34',
    'E.P.E'    => 'IV_35',
    'E.Pl.E'   => 'IV_36',
    'E.E.P'    => 'IV_37',
    'E.E.Pl'   => 'IV_38',
    'E.E.E'    => 'IV_39',
);

<DATA>;    # Skip the headers (first row).

my %tree;
while (<DATA>) {
    # parse through the input data and fill in our tree data structure
    chomp;
    my ($child, $parent, $prob) = split /\t/;

    if ($child eq 'Q') {
        push @{$tree{$child}}, {parent => '', prob => $prob, dist => 0
+};
        next;
    }

    if ($parent eq 'Q') {
        push @{$tree{$child}}, {parent => $parent, prob => $prob, dist
+ => 1};
        next;
    }

    for my $opt (@{$tree{$parent}}) {
        my $dist = $opt->{dist} + 1;
        push @{$tree{$child}},
            {parent => $parent, prob => $prob, dist => $dist};
    }
}

for my $child (sort {length $a <=> length $b or $a cmp $b} keys %tree)
+ {
    my @bestPath = findBestPath($child, \%tree);
    my $probs = join '.', map {$_->{prob}} @bestPath;
    printf "%-5s ", "$child:";

    # Join the likelihood path. Then if group is found for a likelihoo
+d
    #from the group hash table then print it, else quit
    print join '<-', $child, grep {$_} map {$_->{parent}} @bestPath;
    print ", $probs";
    print ", $group{$probs}" if exists $group{$probs};
    print "\n";
}

sub findBestPath {
    my ($child, $tree) = @_;

    return $tree->{Q}[0] if $child eq 'Q';

    my @alts = sort {$a->{dist} <=> $b->{dist}} @{$tree->{$child}};
    return $alts[0], findBestPath($alts[0]{parent}, $tree);
}

__DATA__
child,    Parent,    likelihood
M7    Q    P
M54    M7    Pl
M213    M54    E
M206    M54    E
M194    M54    E
...
[download]

Prints (in part):

Q:    Q, E, II_15
M6:   M6<-Q, E.E, IV_32
M7:   M7<-Q, P.E, II_16
M10:  M10<-Q, E.E, IV_32
M13:  M13<-M7<-Q, E.P.E, IV_35
M17:  M17<-Q, P.E, II_16
M18:  M18<-Q, E .E
M22:  M22<-Q, E.E, IV_32
M23:  M23<-Q, E.E, IV_32
M28:  M28<-M6<-Q, P.E.E, IV_33
M33:  M33<-M28<-M6<-Q, E.P.E.E
[download]

True laziness is hard work

Comment on Re: find shortest path for each query from a CSV file Select or Download Code

Replies are listed 'Best First'.
Re^2: find shortest path for each query from a CSV file by zing (Beadle) on Nov 22, 2013 at 12:07 UTC
Sorry but Im getting this error (even though I have tried download link under your code) :- Use of uninitialized value in join or string at check_22nov_metabolite +_pred_2.pl line 74, <DATA> line 6. M7: M7<-Q, P. Use of uninitialized value in join or string at check_22nov_metabolite +_pred_2.pl line 74, <DATA> line 6. M54: M54<-M7<-Q, Pl.P. Use of uninitialized value in join or string at check_22nov_metabolite +_pred_2.pl line 74, <DATA> line 6. M194: M194<-M54<-M7<-Q, E.Pl.P. Use of uninitialized value in join or string at check_22nov_metabolite +_pred_2.pl line 74, <DATA> line 6. M206: M206<-M54<-M7<-Q, E.Pl.P. Use of uninitialized value in join or string at check_22nov_metabolite +_pred_2.pl line 74, <DATA> line 6. M213: M213<-M54<-M7<-Q, E.Pl.P. [download]	[reply] [d/l]
Re^3: find shortest path for each query from a CSV file by GrandFather (Saint) on Nov 22, 2013 at 23:15 UTC
I truncated the data in the code I posted to reduce the number of uninteresting lines. The '...' is an ellipsis and is used to indicate missing data. If you substitute the data from Re^2: find shortest path for each query from a CSV file the code runs correctly without warnings. True laziness is hard work	[reply]
Re^2: find shortest path for each query from a CSV file by Anonymous Monk on Nov 22, 2013 at 19:52 UTC
Please help Im still getting this error :- `Use of uninitialized value in join or string at check_22nov_metabolite +_pred_2.pl line 74, <DATA> line 6. M7: M7<-Q, P. Use of uninitialized value in join or string at check_22nov_metabolite +_pred_2.pl line 74, <DATA> line 6. M54: M54<-M7<-Q, Pl.P.` [download]	[reply] [d/l]
Re^3: find shortest path for each query from a CSV file by VincentK (Beadle) on Nov 22, 2013 at 21:33 UTC
Just glancing at the code GrandFather posted I see a couple of spots where term 'prob' appears to be a bare word. At some point should it be a variable? I could be wrong here. There are also three dots in the __DATA__ portion that should probably be deleted too.	[reply]
Re^2: find shortest path for each query from a CSV file by zing (Beadle) on Nov 25, 2013 at 06:05 UTC
Sorry but theres a problem with the code. For example consider 2nd line of you output. Its giving double probablity in third column (E.E) `M6: M6<-Q, E.E, IV_32` Whereas according to the __DATA__ M6 is coming directly from Q `M6 Q E` Thus correct output should be :- `M6: M6<-Q, E, II_15` To give you an intuition this line in data means that the probability of M6 coming from Q is 'E'. That is what I want in third column. If suppose M6 were coming from M76 which in turn comes from Q `M6<-M76<-Q` and the input data for these were `__DATA__ M6 M76 E M76 Q E` [download] Then in this case M6: M6<-M76<-Q, E.E, IV_32 would have been a correct output. So basically the third column is giving wront output, due to which fourth column is also giving incorrect results as it is based on third for its input.	[reply] [d/l] [select]
Re^3: find shortest path for each query from a CSV file by GrandFather (Saint) on Nov 25, 2013 at 06:29 UTC
Sorry, but any problem with the code is now your problem. A trivial examination of the output and thinking about it will tell you why it is as it as. Feel free to correct the code as you see fit. True laziness is hard work	[reply]