zing has asked for the wisdom of the Perl Monks concerning the following question:

I have this code which takes in an input file and prints out shortest path from each child to the parent "Q". Each child has to find a shortest path to "Q" only, if that path exists then print else print blank.

perl short_path.pl input_fie output_file

#! /usr/bin/perl #=======================Processing the input file===================== +===# print "Enter file name:\n"; chomp(my $file = <STDIN>); open(DATA,$file) or die "failed opening file!!"; my %DEF = ( I => [qw( P Pl P.P P.Pl Pl.P Pl.Pl P.P.P P.P.Pl P.Pl.P P.Pl.Pl Pl. +P.P Pl.P.Pl Pl.Pl.P Pl.Pl.Pl )], II => [qw( E P.E Pl.E P.P.E P.Pl.E Pl.P.E Pl.Pl.E )], III => [qw( E.P E.Pl P.E.P P.E.Pl Pl.E.P Pl.E.Pl E.P.P E.P.Pl E.Pl.P + E.Pl.Pl )], IV => [qw( E.E P.E.E Pl.E.E E.P.E E.Pl.E E.E.P E.E.Pl E.E.E )] ); # Hash table/dictionary for all the groups my @rank = map @$_, @DEF{qw(I II III IV)}; my %rank = map {$rank[$_-1] => $_} 1..@rank; my @group = map {($_) x @{$DEF{$_}}} qw(I II III IV); my %group = map {$rank[$_-1] => $group[$_-1]."_".$_} 1..@group; sub rank { $rank{$a->[2]} <=> $rank{$b->[2]} } my %T; sub oh { map values %$_, @_ } sub ab { my ($b, $a) = @_; [$b->[0], $a->[1], qq($a->[2].$b->[2]), qq($b->[3]<-$a->[3])] } sub xtend { my $a = shift; map {ab $_, $a} oh @{$T{$a->[0]}}{@_} } sub ins { $T{$_[3] //= $_[1]}{$_[2]}{$_[0]} = \@_ } ins split /,\s*/ for <DATA>; #ins split /,\s*/ for $filename; ins @$_ for map {xtend $_, qw(P E Pl)} (oh oh oh \%T); ins @$_ for map {xtend $_, qw(P E Pl)} (oh oh oh \%T); for (sort {rank} grep {$_->[1] eq 'Q'} (oh oh oh \%T)) { printf "%-4s: %20s, %-8s %6s\n", $_->[0], qq($_->[0]<-$_->[3]), $_->[2], $group{$_->[2]}; }
The program works fine on the undermentioned small sample file.

SAMPLE_INPUT ( 1st column is child, 2nd column is its parent, 3rd column is relation). So for example 1st row means, M19 is child of Q and relation is P.

M19,Q,P, M31,M19,Pl, M420,M31,E, M421,M31,E, M33,M31,E, M438,M33,Pl, M445,M33,E, M437,M33,E, M444,M33,E, M73,M33,E, M552,M73,Pl, M553,M73,Pl, M569,M73,E, M549,M73,E, M550,M73,E, M73,Q,P,
OUTPUT (column 1 represents child, column 2 is shortest path from that child to parent "Q" if it exists, column 3 is the concatenated relation for each individual path, column 4 is the group_id from DEF.)
M19 : M19<-Q, P I_1 M73 : M73<-Q, P I_1 M31 : M31<-M19<-Q, P.Pl I_4 M552: M552<-M73<-Q, P.Pl I_4 M553: M553<-M73<-Q, P.Pl I_4 M549: M549<-M73<-Q, P.E II_16 M569: M569<-M73<-Q, P.E II_16 M550: M550<-M73<-Q, P.E II_16 M33 : M33<-M31<-M19<-Q, P.Pl.E II_19 M420: M420<-M31<-M19<-Q, P.Pl.E II_19 M421: M421<-M31<-M19<-Q, P.Pl.E II_19

But it fails on a input file of say 2000 lines the download link for this file http://qfs.mobi/f1508588

Please help me on this.

Replies are listed 'Best First'.
Re: Perl code for finding shortest path not working on large files
by QM (Parson) on Jul 25, 2014 at 08:39 UTC
    This is dense code. I don't really want to untangle it to discover what it does. (It reminds me a bit of golf code.) Here are some random comments:

    There is 1 almost useful comment:

    # Hash table/dictionary for all the groups

    ...which doesn't explain much.

    There are peculiar things, like:

    sub oh { map values %$_, @_ }

    which pulls the values out of a list of hashrefs. But then is used as:

    ins @$_ for map {xtend $_, qw(P E Pl)} (oh oh oh \%T);

    which assumes that there is a 3-level hash. Why not do all of that in the sub, and pass in a depth, or detect the depth?

    Is the sample input the DATA?

    ins split /,\s*/ for <DATA>;

    How does it fail on the big file? Give a sample output where it's wrong (and right, if it's ever right).

    -QM
    --
    Quantum Mechanics: The dreams stuff is made of

      Yes sample input is read in as <DATA>
Re: Perl code for finding shortest path not working on large files (chomp)
by Athanasius (Cardinal) on Jul 25, 2014 at 12:56 UTC

    Hello zing,

    A little debugging reveals that the difference in behaviour has nothing to do with the length of the two input files, and everything to do with the fact that in the file which works, each line has a comma immediately before the newline, whereas in the file which fails (p17226.csv), the comma is missing. When this terminal comma is omitted, split leaves the newline in the final field. So the call to rank() accesses, e.g., $rank{"E\n"}, which is different to $rank{"E"}, and since the former key does not exist in the hash, the comparison generates a warning and fails to work as desired.

    Like QM, I’m largely in the dark as to what this code is doing and how it is supposed to work. However, by changing this:

    ins split /,\s*/ for <DATA>;

    to this:

    for (<DATA>) { chomp; ins split /,\s*/; }

    I managed to run the script on file p17226.csv (2794 lines) with apparent success.

    Hope that helps,

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

      ======The much needed explanation of the code====

      Im sorry for the delay but here it is, and hope Im clear enough.

      Consider the input file:
      Eve,BigDaddy,Father John,Eve,Son John,Chang,Son Chang,Eve,Mother
      So in this case consider the first column,we have 3 children : Eve, John, Chang.

      The third column is the relation of second column to first column.

      For each of them we need to find their shortest link to the BigDaddy(or "Q" as I have shown in the sample input file. "Q" is the BigDaddy in the sample input case. So for Eve we already have the shortest path to BigDaddy which is

      Eve :    Eve<-BigDaddy,  Father , Male_relations_1

      While for John, John is related to eve as (Son) and Eve is BigDaddy as (Father). Thus the third column will have their relationships concatenated :

      John:    John<-Eve<-BigDaddy, Son.Father , Male_relations_5

      The fourth column is the Relation id for these concatenated relations. These sets of relation ids will already be given (as you can see in the code its inside the hash %DEF). Eg

      DEF = ( Male_relations => [qw(Father Father.Son Son Brother Son.Father +.....)], Female_relations => [qw(Mother Mother.Son Aunt Aunt.Father ....)])

      The concatenated relation between John and BigDaddy is Son.Father, which is number 5 in the Male_relations, hence we denote Male_relations_5

      Hi Athanasius, I tried your suggestion but Im getting a partial output of only 25 lines (while the input file P17226.csv as provided is 2794 lines long).

      This is what Im getting after trying your suggestion

      Please check and suggest.
      M3<-Q, Pl M5<-Q, Pl M22<-Q, Pl M24<-Q, Pl M12<-Q, Pl M23<-Q, Pl M14<-Q, Pl M15<-Q, Pl M31<-Q, Pl M11<-Q, Pl M25<-Q, Pl M27<-Q, Pl M30<-Q, Pl M33<-Q, E M34<-Q, E M10<-Q, E M32<-Q, E M29<-Q, E M18<-Q, E M7<-Q, E M1<-Q, E M6<-Q, E M28<-Q, E M9<-Q, E M19<-Q, P

        Hello zing,

        When I take your original script, make the change I described, and run it on the file “p17226.csv” (which I downloaded from http://qfs.mobi/f1508588), I get 848 lines of output which begins and ends as follows:

        16:31 >perl orig.pl Enter file name: p17226.csv M19 : M19<-Q, P I_1 M15 : M15<-Q, Pl I_2 M23 : M23<-Q, Pl I_2 M11 : M11<-Q, Pl I_2 M22 : M22<-Q, Pl I_2 M31 : M31<-Q, Pl I_2 M27 : M27<-Q, Pl I_2 M24 : M24<-Q, Pl I_2 M5 : M5<-Q, Pl I_2 M12 : M12<-Q, Pl I_2 ... M567: M567<-M73<-M1<-Q, E.E.E IV_39 M560: M560<-M73<-M1<-Q, E.E.E IV_39 M559: M559<-M73<-M1<-Q, E.E.E IV_39 M561: M561<-M73<-M1<-Q, E.E.E IV_39 M550: M550<-M73<-M1<-Q, E.E.E IV_39 M549: M549<-M73<-M1<-Q, E.E.E IV_39 M569: M569<-M73<-M1<-Q, E.E.E IV_39 M563: M563<-M73<-M1<-Q, E.E.E IV_39 M568: M568<-M73<-M1<-Q, E.E.E IV_39 M562: M562<-M73<-M1<-Q, E.E.E IV_39

        Your output format has changed, have you altered the script since first posting? If not, it looks as though you’re going to have to debug the script yourself, on your own machine, to find out exactly what’s going on. The tutorial perldebtut will help you to get started.

        Hope that helps,

        Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

Re: Perl code for finding shortest path not working on large files
by Anonymous Monk on Jul 25, 2014 at 07:40 UTC

    But it fails on a input file of say 2000 lines the download link for this file http://qfs.mobi/f1508588

    Fails how?

      Fails as in the output is only 20 lines of incorrect associations and paths. The input at the link is somewhat 2500 lines and has more than 400 unique childs, so atleast 400 paths is what I expect.

        So, do you think the problem is with reading the data into a datastructure?

        What happens if you add the raw data present in the incorrect associations, and add it to your sample input, does the program produce the wrong associations?

        What if you repeat the sample input" so it adds up to about 2000 lines, does the program produce incorrect associations then?

        There is a bad assumption somewhere in the program, you just have to find it ...

        Maybe you can spot the problem if you dump the vars at various points, dd(\%T) ... knowing what should be in your vars at each point in your program , is how you're going to solve this :) its easier to track with a minimal amount of raw input

        sub dd { use Data::Dumper; print Data::Dumper->new([@_])->Sortkeys(1) ->Indent(1)->Useqq(1)->Dump . "\n"; }