Perl code for finding shortest path not working on large files

zing has asked for the wisdom of the Perl Monks concerning the following question:

I have this code which takes in an input file and prints out shortest path from each child to the parent "Q". Each child has to find a shortest path to "Q" only, if that path exists then print else print blank.

perl short_path.pl input_fie output_file

#! /usr/bin/perl



#=======================Processing the input file=====================
+===#
print "Enter file name:\n";
chomp(my $file = <STDIN>);
open(DATA,$file) or die "failed opening file!!";
my %DEF = (
  I   => [qw( P Pl P.P P.Pl Pl.P Pl.Pl P.P.P P.P.Pl P.Pl.P P.Pl.Pl Pl.
+P.P Pl.P.Pl Pl.Pl.P Pl.Pl.Pl )],
  II  => [qw( E P.E Pl.E P.P.E P.Pl.E Pl.P.E Pl.Pl.E )],
  III => [qw( E.P E.Pl P.E.P P.E.Pl Pl.E.P Pl.E.Pl E.P.P E.P.Pl E.Pl.P
+ E.Pl.Pl )],
  IV  => [qw( E.E P.E.E Pl.E.E E.P.E E.Pl.E E.E.P E.E.Pl E.E.E )]
); # Hash table/dictionary for all the  groups
my @rank = map @$_, @DEF{qw(I II III IV)};
my %rank = map {$rank[$_-1] => $_} 1..@rank;
my @group = map {($_) x @{$DEF{$_}}} qw(I II III IV);
my %group = map {$rank[$_-1] => $group[$_-1]."_".$_} 1..@group;
sub rank { $rank{$a->[2]} <=> $rank{$b->[2]} }

my %T;

sub oh { map values %$_, @_ }
sub ab {
  my ($b, $a) = @_;
  [$b->[0], $a->[1], qq($a->[2].$b->[2]), qq($b->[3]<-$a->[3])]
}
sub xtend {
  my $a = shift;
  map {ab $_, $a} oh @{$T{$a->[0]}}{@_}
}
sub ins { $T{$_[3] //= $_[1]}{$_[2]}{$_[0]} = \@_ }

ins split /,\s*/ for <DATA>;
#ins split /,\s*/ for $filename;
ins @$_ for map {xtend $_, qw(P E Pl)} (oh oh oh \%T);
ins @$_ for map {xtend $_, qw(P E Pl)} (oh oh oh \%T);

for (sort {rank} grep {$_->[1] eq 'Q'} (oh oh oh \%T)) {
  printf "%-4s: %20s,  %-8s %6s\n",
      $_->[0], qq($_->[0]<-$_->[3]), $_->[2], $group{$_->[2]};
}
[download]

The program works fine on the undermentioned small sample file.

SAMPLE_INPUT ( 1st column is child, 2nd column is its parent, 3rd column is relation). So for example 1st row means, M19 is child of Q and relation is P.

M19,Q,P,
M31,M19,Pl,
M420,M31,E,
M421,M31,E,
M33,M31,E,
M438,M33,Pl,
M445,M33,E,
M437,M33,E,
M444,M33,E,
M73,M33,E,
M552,M73,Pl,
M553,M73,Pl,
M569,M73,E,
M549,M73,E,
M550,M73,E,
M73,Q,P,
[download]

OUTPUT (column 1 represents child, column 2 is shortest path from that child to parent "Q" if it exists, column 3 is the concatenated relation for each individual path, column 4 is the group_id from DEF.)

M19 :               M19<-Q,  P           I_1
M73 :               M73<-Q,  P           I_1
M31 :          M31<-M19<-Q,  P.Pl        I_4
M552:         M552<-M73<-Q,  P.Pl        I_4
M553:         M553<-M73<-Q,  P.Pl        I_4
M549:         M549<-M73<-Q,  P.E       II_16
M569:         M569<-M73<-Q,  P.E       II_16
M550:         M550<-M73<-Q,  P.E       II_16
M33 :     M33<-M31<-M19<-Q,  P.Pl.E    II_19
M420:    M420<-M31<-M19<-Q,  P.Pl.E    II_19
M421:    M421<-M31<-M19<-Q,  P.Pl.E    II_19
[download]

But it fails on a input file of say 2000 lines the download link for this file http://qfs.mobi/f1508588

Please help me on this.

Comment on Perl code for finding shortest path not working on large files Select or Download Code

Replies are listed 'Best First'.

Re: Perl code for finding shortest path not working on large files
by QM (Parson) on Jul 25, 2014 at 08:39 UTC

There is 1 almost useful comment:

# Hash table/dictionary for all the  groups
[download]

...which doesn't explain much.

There are peculiar things, like:

sub oh { map values %$_, @_ }
[download]

which pulls the values out of a list of hashrefs. But then is used as:

ins @$_ for map {xtend $_, qw(P E Pl)} (oh oh oh \%T);
[download]

which assumes that there is a 3-level hash. Why not do all of that in the sub, and pass in a depth, or detect the depth?

Is the sample input the DATA?

ins split /,\s*/ for <DATA>;
[download]

How does it fail on the big file? Give a sample output where it's wrong (and right, if it's ever right).

-QM
--
Quantum Mechanics: The dreams stuff is made of

[reply]
[d/l]
[select]

Re^2: Perl code for finding shortest path not working on large files

by zing (Beadle) on Jul 25, 2014 at 11:24 UTC

Yes sample input is read in as <DATA>

[reply]

Re: Perl code for finding shortest path not working on large files (chomp)
by Athanasius (Cardinal) on Jul 25, 2014 at 12:56 UTC

Hello zing,

A little debugging reveals that the difference in behaviour has nothing to do with the length of the two input files, and everything to do with the fact that in the file which works, each line has a comma immediately before the newline, whereas in the file which fails (p17226.csv), the comma is missing. When this terminal comma is omitted, split leaves the newline in the final field. So the call to rank() accesses, e.g., $rank{"E\n"}, which is different to $rank{"E"}, and since the former key does not exist in the hash, the comparison generates a warning and fails to work as desired.

Like QM, I’m largely in the dark as to what this code is doing and how it is supposed to work. However, by changing this:

ins split /,\s*/ for <DATA>;
[download]

to this:

for (<DATA>)
{
    chomp;
    ins split /,\s*/;
}
[download]

I managed to run the script on file p17226.csv (2794 lines) with apparent success.

Hope that helps,

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

[reply]
[d/l]
[select]

Re^2: Perl code for finding shortest path not working on large files (chomp)

by zing (Beadle) on Jul 26, 2014 at 07:54 UTC

Im sorry for the delay but here it is, and hope Im clear enough.

Eve,BigDaddy,Father
John,Eve,Son
John,Chang,Son
Chang,Eve,Mother
[download]

The third column is the relation of second column to first column.

For each of them we need to find their shortest link to the BigDaddy(or "Q" as I have shown in the sample input file. "Q" is the BigDaddy in the sample input case. So for Eve we already have the shortest path to BigDaddy which is

Eve : Eve<-BigDaddy, Father , Male_relations_1

While for John, John is related to eve as (Son) and Eve is BigDaddy as (Father). Thus the third column will have their relationships concatenated :

John: John<-Eve<-BigDaddy, Son.Father , Male_relations_5

The fourth column is the Relation id for these concatenated relations. These sets of relation ids will already be given (as you can see in the code its inside the hash %DEF). Eg

DEF = ( Male_relations => [qw(Father Father.Son Son Brother Son.Father
+.....)],
Female_relations => [qw(Mother Mother.Son Aunt Aunt.Father ....)])
[download]

The concatenated relation between John and BigDaddy is Son.Father, which is number 5 in the Male_relations, hence we denote Male_relations_5

[reply]
[d/l]
[select]

Re^2: Perl code for finding shortest path not working on large files (chomp)

by zing (Beadle) on Jul 28, 2014 at 06:21 UTC

This is what Im getting after trying your suggestion

                     M3<-Q,  Pl
                     M5<-Q,  Pl
                    M22<-Q,  Pl
                    M24<-Q,  Pl
                    M12<-Q,  Pl
                    M23<-Q,  Pl
                    M14<-Q,  Pl
                    M15<-Q,  Pl
                    M31<-Q,  Pl
                    M11<-Q,  Pl
                    M25<-Q,  Pl
                    M27<-Q,  Pl
                    M30<-Q,  Pl
                    M33<-Q,  E
                    M34<-Q,  E
                    M10<-Q,  E
                    M32<-Q,  E
                    M29<-Q,  E
                    M18<-Q,  E
                     M7<-Q,  E
                     M1<-Q,  E
                     M6<-Q,  E
                    M28<-Q,  E
                     M9<-Q,  E
                    M19<-Q,  P
[download]

[reply]
[d/l]

Re^3: Perl code for finding shortest path not working on large files (chomp)

by Athanasius (Cardinal) on Jul 28, 2014 at 06:52 UTC

Hello zing,

When I take your original script, make the change I described, and run it on the file “p17226.csv” (which I downloaded from http://qfs.mobi/f1508588), I get 848 lines of output which begins and ends as follows:

16:31 >perl orig.pl
Enter file name:
p17226.csv
M19 :               M19<-Q,  P           I_1
M15 :               M15<-Q,  Pl          I_2
M23 :               M23<-Q,  Pl          I_2
M11 :               M11<-Q,  Pl          I_2
M22 :               M22<-Q,  Pl          I_2
M31 :               M31<-Q,  Pl          I_2
M27 :               M27<-Q,  Pl          I_2
M24 :               M24<-Q,  Pl          I_2
M5  :                M5<-Q,  Pl          I_2
M12 :               M12<-Q,  Pl          I_2
...
M567:     M567<-M73<-M1<-Q,  E.E.E     IV_39
M560:     M560<-M73<-M1<-Q,  E.E.E     IV_39
M559:     M559<-M73<-M1<-Q,  E.E.E     IV_39
M561:     M561<-M73<-M1<-Q,  E.E.E     IV_39
M550:     M550<-M73<-M1<-Q,  E.E.E     IV_39
M549:     M549<-M73<-M1<-Q,  E.E.E     IV_39
M569:     M569<-M73<-M1<-Q,  E.E.E     IV_39
M563:     M563<-M73<-M1<-Q,  E.E.E     IV_39
M568:     M568<-M73<-M1<-Q,  E.E.E     IV_39
M562:     M562<-M73<-M1<-Q,  E.E.E     IV_39
[download]

Your output format has changed, have you altered the script since first posting? If not, it looks as though you’re going to have to debug the script yourself, on your own machine, to find out exactly what’s going on. The tutorial perldebtut will help you to get started.

Hope that helps,

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

[reply]
[d/l]

Re^4: Perl code for finding shortest path not working on large files (chomp)

by zing (Beadle) on Jul 28, 2014 at 10:53 UTC

Re^5: Perl code for finding shortest path not working on large files (chomp)

by QM (Parson) on Jul 28, 2014 at 12:40 UTC

Re^5: Perl code for finding shortest path not working on large files (chomp)

by Athanasius (Cardinal) on Jul 28, 2014 at 13:50 UTC

Re: Perl code for finding shortest path not working on large files
by Anonymous Monk on Jul 25, 2014 at 07:40 UTC

But it fails on a input file of say 2000 lines the download link for this file http://qfs.mobi/f1508588

Fails how?

[reply]

Re^2: finding shortest path not working on large files

by Anonymous Monk on Jul 25, 2014 at 09:10 UTC

FWIW,

Crossposted at http://stackoverflow.com/questions/24949575/perl-code-for-finding-shortest-path-not-working-on-large-files. I cross-linked this cross-postings for maximum collaboration efficiency

original author appears to be oiskuu Re: find shortest path for each query from a CSV file

[reply]

Re^2: Perl code for finding shortest path not working on large files

by zing (Beadle) on Jul 25, 2014 at 08:13 UTC

Fails as in the output is only 20 lines of incorrect associations and paths. The input at the link is somewhat 2500 lines and has more than 400 unique childs, so atleast 400 paths is what I expect.

[reply]

Re^3: Perl code for finding shortest path not working on large files

by Anonymous Monk on Jul 25, 2014 at 08:37 UTC

So, do you think the problem is with reading the data into a datastructure?

What happens if you add the raw data present in the incorrect associations, and add it to your sample input, does the program produce the wrong associations?

What if you repeat the sample input" so it adds up to about 2000 lines, does the program produce incorrect associations then?

There is a bad assumption somewhere in the program, you just have to find it ...

Maybe you can spot the problem if you dump the vars at various points, dd(\%T) ... knowing what should be in your vars at each point in your program , is how you're going to solve this :) its easier to track with a minimal amount of raw input

sub dd {
  use Data::Dumper;
  print Data::Dumper->new([@_])->Sortkeys(1)
    ->Indent(1)->Useqq(1)->Dump . "\n";
}
[download]

[reply]
[d/l]