johnirl has asked for the wisdom of the Perl Monks concerning the following question:

Hey Monks
this question is not so much a question for which I require a Monk to supply code but rather I require a Monk to use his Perl Wisdom to explain some code to me. The code below does the following. It is supplied with data of the following type.
-,--,--,--,1.280000e+2,9.930000e+0 --,--,--,--,1.920000e+2,9.950000e+0 --,--,--,--,2.560000e+2,1.013000e+1 --,--,--,--,2.000000e+0,4.370000e+0 --,--,--,--,4.000000e+0,5.300000e+0 --,--,--,--,8.000000e+0,6.590000e+0 --,--,--,--,1.600000e+1,7.830000e+0 --,--,--,--,2.400000e+1,8.710000e+0 --,--,--,--,3.200000e+1,9.160000e+0 --,--,--,--,6.400000e+1,9.510000e+0 --,2.000000e+0,6.500000e+0,--,--,-- --,2.000000e+0,6.450000e+0,--,--,-- --,4.000000e+0,6.650000e+0,--,--,-- --,4.000000e+0,6.570000e+0,--,--,-- --,8.000000e+0,6.550000e+0,--,--,-- --,8.000000e+0,6.600000e+0,--,--,-- --,1.600000e+1,6.570000e+0,--,--,-- --,1.600000e+1,6.570000e+0,--,--,-- --,2.400000e+1,6.650000e+0,--,--,-- --,2.400000e+1,6.680000e+0,--,--,-- --,2.400000e+1,6.640000e+0,--,--,-- --,3.200000e+1,6.720000e+0,--,--,--
It then sorts the data into the following format
2.000000e+0,4.370000e+0,2.000000e+0,6.500000e+0 4.000000e+0,5.300000e+0,4.000000e+0,6.650000e+0 8.000000e+0,6.590000e+0,8.000000e+0,6.550000e+0 1.600000e+1,7.830000e+0,1.600000e+1,6.570000e+0 2.400000e+1,8.710000e+0,2.400000e+1,6.650000e+0 3.200000e+1,9.160000e+0,3.200000e+1,6.720000e+0 1.280000e+2,9.930000e+0,-,- 1.920000e+2,9.950000e+0,--,-- 2.560000e+2,1.013000e+1,--,-- 6.400000e+1,9.510000e+0,--,-- --,--,2.000000e+0,6.450000e+0 --,--,4.000000e+0,6.570000e+0 --,--,2.400000e+1,6.640000e+0 --,--,8.000000e+0,6.600000e+0 --,--,1.600000e+1,6.570000e+0 --,--,2.400000e+1,6.680000e+0

What it does is match up anything from the first half whose fifth value is equal to anything in the second halfs second value. Confused? Imagine that the rows begining with --, --, --, -- are the first half and the rest the second. So now we have two seperate sets. Now ignore all "--". This leaves you with two sets of two columns. What it does is match the first in each of these sets. i.e. above the fourth row in the first half matched the first row in the second half.
I need to expand on this code to be able to handle more data with variable length of columns, match different coulmns etc. However I don't know how this code works. Hence this wisdom I seek is........How does this code work?

Thanks in advance Monks

#!/usr/bin/perl -w my @L = (); my @R = (); my $file = "< SqlResults_full"; open(DATA, $file) or die "Can\'t open " . $file . " for output : $!"; while(<DATA>) { # BUILD @LIST chomp; next unless $_; my @Line = split ',', $_; if($Line[4] ne '--') { # 5th value real? push @R, \@Line; } elsif($Line[1] ne '--') { #2nd value real? push @L, \@Line; } } $\="\n"; print "R ".scalar(@R); print "L ".scalar(@L); print "TOTAL LINES ".( @R + @L ); use Data::Dumper; COMPARE(\@L,\@R); sub COMPARE { my( $L, $R ) = @_; my @Ret = (); my %L = map { $_ => $_; } 0..$#$L; my %R = map { $_ => $_; } 0..$#$R; for my $I(0..$#$R ) { for my $J(0..$#$L ) { if($R->[$I]->[4] eq $L->[$J]->[1]) { next unless exists $L{$J}; delete $L{$J}; delete $R{$I}; print join ',', @{ $R->[$I] }[4,5], @{ $L->[$J] }[1,2] +; last; } } } print join ',', @{ $R->[$_] }[4,5,0,0] for keys %R; print join ',', @{ $L->[$_] }[0,0,1,2] for keys %L; }

j o h n i r l .

Sum day soon I'Il lern how 2 spelI (nad tYpe)

Replies are listed 'Best First'.
Re: Enlightenment
by RMGir (Prior) on Aug 28, 2002 at 12:34 UTC
    It's not as complicated as it looks; it would be much easier to understand if COMPARE didn't use $L as an LoL reference, and %L as a hash.

    The main part of the program builds up 2 LOL (lists of lists, see perldsc). @R contains lists where the 5th element isn't --, and @L contains those where the 5th element IS and 2nd element isn't.

    Note that there's no error checking for cases where both the 2nd and 5th elements are --, I don't know if that matters to you.

    The 2 LOL's are then passed into the COMPARE routine. By the way, the

    use Data::Dumper;
    looks like a red herring; I don't see anything there that needs that module.

    In compare, $L and $R are references to the LOLs built up in the main routine. %L and %R are hashes that map all of the indexes in @$L and @$R respectively to (anything, since the key existence is all that matters), to keep a set of the rows that still need to be processed.

    The nested for loops go over every possible combination of items in @$R and @$L, checking if the 4th element in the @$R entry matches the 2nd in @$L. In that case, and if that entry in @$L hasn't already been matched (that's the next unless part), those entries are deleted from %R and %L (so they don't get printed out later), and the "match" line is printed out. The "last;" call ends the $J loop, and moves processing on to the next item in @$R.

    After the 2 loops, all the unmatched entries are printed out.

    Is it clearer now, or did I make it worse? :)
    --
    Mike

      Thanks Mike it's much clearer,
      that really helped but it's the syntax in the COMPARE subroutine thats giving me trouble. Is there any chance you could explain what each line in COMPARE is doing?
      Sorry about asking you to explain what may be simple code but I'm still learning. :-) Hopefully quickly.

      j o h n i r l .

      Sum day soon I'Il lern how 2 spelI (nad tYpe)

        Sure, I've added the comments in the code here. This will make a lot more sense if you read the perldsc perldoc page first, though.
        # pass in _references_ to the arrays of lists @L and @R # each element is the list of fields on a given line COMPARE(\@L,\@R); sub COMPARE { # $L is a reference to @L, $R is a reference to @R my( $L, $R ) = @_; # @Ret isn't used, it's just here to confuse you :) my @Ret = (); # %L is a hash with an entry for every index in @L # so we can skip lines already matched, and print # unmatched lines at the end my %L = map { $_ => $_; } 0..$#$L; # %R is a hash with an entry for every index in @R my %R = map { $_ => $_; } 0..$#$R; for my $I(0..$#$R ) { for my $J(0..$#$L ) { # if the @R line matches the @L line, based on # the key fields if($R->[$I]->[4] eq $L->[$J]->[1]) { # skip if we've already processed this @L # line next unless exists $L{$J}; # delete these lines from %L and %H so we # don't print them at the end delete $L{$J}; delete $R{$I}; # print fields 5 and 6 of the R line, and # fields 2 and 3 of the L line print join ',', @{ $R->[$I] }[4,5], @{ $L->[$J] }[1,2] +; # go to next R line; this R line is already # matched last; } } } # print out the unmatched R lines, printing only fields # 5 and 6 print join ',', @{ $R->[$_] }[4,5,0,0] for keys %R; # print out the unmatched R lines, printing only fields # 2 and 3 print join ',', @{ $L->[$_] }[0,0,1,2] for keys %L; }

        --
        Mike