snafu has asked for the wisdom of the Perl Monks concerning the following question:

From the top!

I have three files all one column with possibly different rows of digits. ie

file1:     file2:     file3:
header     header     header
1          5          1
2          2          3
3          3          3
4          4          4
5          6          5
Remember, these are three seperate files. Now, what I need to do is get those values from each row in each file into one list with each value delimited by a pipe "|" and stripped of the header. BUT WAIT! It gets more cumbersome. Evaluation needs to be done on a line by line basis to determine if each field matches each other and if they don't match then the field for that column/row combo is left blank.
I am really bad at using arrays in Perl. I am working on it though and this exercise is helpful, however, I am in need of a lil nudge and perhaps some suggestions. Here is the code I am using so far:
#!/usr/local/bin/perl -w use strict; my (@list1, @list2, @list3, $c); open(F1,"f1.lst"); open(F2,"f2.lst"); open(F3,"f3.lst"); # Populate the arrays $c = 0; while ( <F1> ) { next if ( $c == 0 ) and $c++; # skip the file header chomp; $list1[$c++] = "$_ |" ; }; close(F1); $c = 0; while ( <F2> ) { next if ( $c == 0 ) and $c++; # skip the file header chomp; $list2[$c++] = "$_ |" ; } close(F2); $c = 0; while ( <F3> ) { next if ( $c == 0 ) and $c++; # skip the file header chomp; $list3[$c++] = "$_ |" ; } close(F3); I have no clue where to begin...
My biggest problem right now is what I want to do may not work. I'd like to populate the arrays and then somehow compare each element in each array to each other on a row by row basis...if a given two elements don't match between two arrays bump the lower element value of the two element values in the comparison into the next elemental slot in its array (subsequently bumping all elements in that array up one).

The output would be similar to this:

final file:
1 |   | 1
2 | 2 |   
3 | 3 | 3
4 | 4 | 4
5 |   | 5
  | 6 |   
The comparison would be like this:

compare list1 row1 to list2 row1 if match go on, if !match (perform element bump here)
compare list2 row1 to list3 row1 if match go on, if !match (perform element bump here)

compare list1 row2 to list2 row2 if match go on, if !match (perform element bump here)
. . . etc

The input has no need to be sorted as the files are already sorted. I believe I am on a right track but by all means no where near the most efficient track which is my goal. I have been working on code all day and so my mind is fried right now. With my luck, I will think of a solution on the way home. I apologize if this question is not worthy of attention :)

TIA fellow monks.

----------
- Jim

Replies are listed 'Best First'.
Re: Printing out multiple array lists and more!
by bbfu (Curate) on May 17, 2001 at 03:41 UTC

    A couple of things I noticed with your code...

    •     next if ( $c == 0 ) and $c++;  # skip the file header

      The and $c++ will probably never get called. It's certainly not what you meant, which would be more like this:

          next if ( $c++ == 0 ); # skip the file header

    • You have effectively the same while code repeated three times. Such repitition is almost always a sign that you should be writing a subroutine.
    •     $list1[$c++] = "$_ |" ;

      Firstly, the increment on $c should've done at the start of the loop. Although, you need to adjust it to account for the header line. You should read in the header line before the loop. Then you could do away with the if statement totally. Then you'd also want to move the increment to the bottom of the loop so you get 0 the first time through.

      Secondly, (a minor point) "$_ |" would give you a trailing |, which you probably don't want. Also, I don't see why you need to format the elements as you read them anyway. Why not format them as you display them?

    • Multiple variables with very similar names are usually a sign that you should use an array. Even if those variables are arrays themselves. You can use an array-of-arrays... If you can figure it out, they can be rather confusing. Still it's probably something you should learn about eventually... :-)

    Well, here's my go at it. It's may be a bit obtuse, but it works.

    #!/usr/bin/perl -w use strict; my @master_list = (); readfile("f1.lst", \@master_list); readfile("f2.lst", \@master_list); readfile("f3.lst", \@master_list); printf "%s | %s | %s\n", @{$_}[0..2] for(@master_list); sub readfile { my $filename = shift or die "Need filename.\n"; my $listref = shift; # Listed pointed to is modified in place. open my $file, "< $filename" or die "Can't open $filename: $!\n"; my $header = <$file>; my $c = 0; local $_; while(<$file>) { $listref->[$c] ||= []; # use strict doesn't like auto-viv. chomp; # Compare the new value with the first value stored in the list. # First value to be read in for any row is assumed to be # correct. All subsequent values must match that first one. unless(@{$listref->[$c]} and $_ != $listref->[$c][0]) { push @{$listref->[$c]}, $_; } else { push @{$listref->[$c]}, ' '; } ++$c; } close $file or die "Can't close $filename: $!\n"; }

    HTH.

    bbfu
    Seasons don't fear The Reaper.
    Nor do the wind, the sun, and the rain.
    We can be like they are.

      First off I would like to say thank you! This code mostly works but I get some unitialized value errors toward the end and nothing seems to be coming from file2 in the output. I'll show you that in a minute. First, discussion.

      Believe it or not, the next if ( $c == 0 ) and $c++; # skip the file header worked! :) It would go to the next iteration the first time through and then it would increment $c. After that first hit though it was pretty much ignored for the duration of the script. However, I like your solution better. It is much cleaner.

      You are absolutely right about the usage of a subroutine for the repetitive read()'ing. The "$_ |" was intentional. In my example I was trying to get a list in the form "var | var | var" and so I was placing the delimiter in the array with the value for the output to be correct. However, once again, the way you do it is much cleaner.

      Indeed, I need to polish up my knowledge of arrays. Arrays of arrays are something that make my head hurt just mentioning. There are a ton of things I need to work on in Perl but I feel I am catching on fast considering this is my second month of coding in Perl. So, without further adieu...I start my questioning for learning process...

      Reference code...

      1 #!/usr/bin/perl -w 2 use strict; 3 4 my @master_list = (); 5 6 readfile("f1.lst", \@master_list); 7 readfile("f2.lst", \@master_list); 8 readfile("f3.lst", \@master_list); 9 10 printf "%s | %s | %s\n", @{$_}[0..2] for(@master_list); 11 12 sub readfile { 13 my $filename = shift or die "Need filename.\n"; 14 my $listref = shift; # Listed pointed to is modified in plac +e. 15 16 open my $file, "< $filename" or die "Can't open $filename: $ +!\n"; 17 18 my $header = <$file>; 19 20 my $c = 0; 21 local $_; 22 while(<$file>) { 23 $listref->[$c] ||= []; # use strict doesn't like auto-viv. 24 chomp; 25 # Compare the new value with the first value stored in the + list. 26 # First value to be read in for any row is assumed to be 27 # correct. All subsequent values must match that first on +e. 28 unless(@{$listref->[$c]} and $_ != $listref->[$c][0]) { 29 push @{$listref->[$c]}, $_; 30 } else { 31 push @{$listref->[$c]}, ' '; 32 } 33 34 ++$c; 35 } 36 37 close $file or die "Can't close $filename: $!\n"; 38 }
    • Question 1:
    • In line 18, my $header = <$file>; -- I assume that this strips the header of each file? If so, how?
    • Question 2:
    • In line 21, why did you local'ize $_? What benefit does this give me?
    • Question 3:
    • I understand what you are doing with the readfile() to an extent. One thing I have not quite grasped is the whole (line 23)  $something_here->[$something_else_here] statement. What is that doing?? I know that the ||= is creating a default value ( $this = "$that" ||= "this" ) but I do not understand what the '[ ]' is doing afterward. I don't understand the open brackets by themselves. I know that brackets denote an element in an array but this eludes me. What is auto-viv? :)

    • Question 4:
    • unless(@{$listref->[$c]} and $_ != $listref->[$c][0]) { push @{$listref->[$c]}, $_; } else { push @{$listref->[$c]}, ' '; }
      I have a few questions here.
      Ok, I get the unless statement's purpose. I don't get how it works. This is mostly because @($listref->[$c]} totally loses me. OTOH, I have pieced together that the ... $_ != $listref->[$c][0] is probably what is actually checking to make sure we skip the 0 element (the header) in the file (which makes me question my $header = <$file>; even more curious). Now the meat! I see push() and pop() all the time. I know the basics of what they do but I have no practical understanding of their usage. I see that you are push()'ing the value from $_ to whatever @{$listref->[$c]} is :) otherwise you make @{$listref->[$c]} = to nothing (my blank if a value does not exist)?

      So, this script is great! I am still struggling to figure what everything is doing but I am going to figure it out. Now, for the part we all hate...debugging.

      When I used the code it looked like it ran beautifully, however, I started getting errors toward the end of the run and something mysteriously eludes me. Let me give you a snapshot of my output:

      1653 |   | 1653
      1654 |   | 1654
      1655 |   | 1655
      1656 |   | 1656
      Use of uninitialized value in printf at try2.pl line 10.
      1657 | 1657 |
      Use of uninitialized value in printf at try2.pl line 10.
      1658 | 1658 |
      Use of uninitialized value in printf at try2.pl line 10.
      1659 | 1659 |
      
      One thing that seems to be wrong here besides the obvious is that there is nothing being returned in the middle of the list (f2.lst). Therefore, for everybodys' coding pleasure I am providing the lists to work with. Aren't I nice?! =P I will try and work with the code you have provided as well to see if I can learn something.

      Again, I appreciate your help!

      ----------
      - Jim

        Well, I'm not sure how you got the next if... line to work. On my machine (Perl 5.6.0), it seems to get executed as:

        next if( $c == 0 and $c++ );

        That is, the next never gets executed because it's basically saying if( $c == 0 and $c != 0 ). Frankly, I expected it to execute the next/if part first and short-cut the ++ so that it went into an infinite loop. Regardless, it seems like a very odd piece of code that is not consistant between systems. *shrug*

        My point about "$_ |" was that you added the trailing pipe character to the third file's values (ie, the third column) so you probably would've ended up with something more like: "var | var | var |" instead.

        Answer to Question 1: my $header = <$file>; does, in fact, strip off the header line. It does this by, basically, reading in the first line of the file. You obviously are familiar with the diamond operator for reading from a file. It just reads in a single line of the file (when called in scalar context) and returns it. There's special magic when it's used as the condition of a while loop (as in your code, and later in mine) that it automagically stores the returned line in $_. But that's just special. Here, we're just manually storing the returned line into $header.

        The purpose of localizing $_ is so that, if the code is called by someone who happens to be using $_ (such as within a foreach or while loop), we don't go and clobber their value in it. You should almost always do this whenever you use global built-ins in a subroutine like this (unless you're using $_ as the default iterator for a for(?:each)? loop, since for loops automatically localize $_ for ya). It's just good practice.

        The $var->[$index] notation means that $var is not actually an array but, rather, a reference to an array (kinda like a pointer in C but safer). You have to use the -> notation to dereference $var so that perl knows you mean to use $var instead of @var. You could also use the curly-brace form: ${$var}[$index] which means the same thing. Luckily, it looks like we won't have to worry about this anymore once Perl 6 comes around. I'm definately looking forward to that. ;-) Auto-vivification is where Perl tries to DWYM when you use an undefined value as an array (or hash) reference. Basically, it creates the anonymous array for you.

        As for unless(@{$listref->[$c]} and $_ != $listref->[$c][0])... Think of $listref as a two dimensional array, with $c as the row and the second index (0, in this case) as the column. So, the $_ != $listref->[$c][0] basically compares the current value ($_) to the value in the first column of the $c'th row. If there are no columns in the current row (ie, @{$listref->[$c]} == 0, or is false), or if the current value is equal to the value in the first column of that row, the push adds another column to the row containing the current value. Otherwise, the other push adds a blank value to that column.

        For more information on array references and multidimensional arrays, check out `perldoc perlreftut` and perlref.

        You're getting the undefined value warnings because second file has less lines in it than the other two files. I didn't put in any checks anywhere to account for files of differing lengths.

        As for the values not getting printed in the second column, that's because the values in file 2 don't match the values in file 1 or 3. Perhaps there was a misunderstanding of the goal. I understood you to mean that you wanted to compare the files on a line-by-line basis, comparing the lines in the file and printing the ones that were equal. Such that the following files (ignoring header lines):

        File 1 File 2 File 3 5 1 5 6 2 6 7 5 7 8 6 8

        Would, in fact, produce the following output:

        5 | | 5 6 | | 6 7 | | 7 8 | | 8

        Perhaps you meant for it to actually compare the files on a value-by-value basis. Such that the same files as above would instead produce this output:

        5 | 5 | 5 6 | 6 | 6 7 | | 7 8 | | 8

        If you meant it to be the latter, let me know and I will try to come up with a solution for that. Perhaps I can even make it more understandable. ;-) Also, I'd like to know if the data is guaranteed to be numeric and, if so, what the range is. That would make the solution a bit easier.

        Anyway, I'm glad to help. I'm sorry my code is so confusing. It's definately not the best I've ever written. :-P At any rate, HTH...

        bbfu
        Seasons don't fear The Reaper.
        Nor do the wind, the sun, and the rain.
        We can be like they are.

Re: Printing out multiple array lists and more!
by mr.nick (Chaplain) on May 17, 2001 at 03:51 UTC
    I'm almost embarressed to submit this (it's very dirty as far as I can tell), but here is my go:
    #!/usr/bin/perl use strict; sub readfile { my $fn=shift; my @ar; open IN,"<$fn" || die "Couldn't open '$fn': $!"; ## snarf the header scalar <IN>; ## read the rest while (<IN>) { chomp; push @ar,$_; } close IN; @ar; } ## load the files into arrays my @a=readfile '1'; my @b=readfile '2'; my @c=readfile '3'; ## iterate until all arrays are empty while (defined @a || defined @b || defined @c) { ## find the lowest value my $s; $s=$a[0] < $b[0] ? $a[0] : $b[0]; $s=$s < $c[0] ? $s: $c[0]; ## print them out (if they match) printf "%5s", $a[0]==$s ? scalar shift @a : undef; print " | "; printf "%5s", $b[0]==$s ? scalar shift @b : undef; print " | "; printf "%5s", $c[0]==$s ? scalar shift @c : undef; print "\n"; }
    What i don't like about this: I don't like the selection of the next value (the assignment to $s). It does, however, allow for non-linear data (you can have 1,2,3,4 in one file, 2,3,10 in another and 66 in the third and it will still work).
      I tried your code as well. Unfortunately, I got no output =P. If you want to work with the lists that I am using click here.

      ----------
      - Jim

Re: Printing out multiple array lists and more!
by Brovnik (Hermit) on May 17, 2001 at 20:48 UTC
    Since the lists are numeric, sorted and fairly small, a simple for loop works well. (Borrowing from mr. nick's code)
    #!/usr/local/bin/perl use strict; my $max = 0; sub readfile { my $fn=shift; my @ar; open IN,"<$fn" || die "Couldn't open '$fn': $!"; ## snarf the header scalar <IN>; ## read the rest while (<IN>) { chomp; push @ar,$_; $max = $_ if $_ > $max; } close IN; @ar; } ## load the files into arrays my @a=readfile 'f1.lst'; my @b=readfile 'f2.lst'; my @c=readfile 'f3.lst'; ## iterate until all arrays are empty for (my $i = 0; $i <= $max; $i++) { next unless $a[0] == $i || $b[0] == $i ||$c[0] == $i; ## print them out (if they match) printf "%5s", $a[0]==$i ? scalar shift @a : undef; print " | "; printf "%5s", $b[0]==$i ? scalar shift @b : undef; print " | "; printf "%5s", $c[0]==$i ? scalar shift @c : undef; print "\n"; }
    This works on the lists you supplied. Brovnik.
      This worked perfectly!! I am studying the code right now. I understand the majority of it. I have one question though...

      Code for reference:

      1 #!/usr/local/bin/perl 2 3 use strict; 4 5 my $max = 0; 6 7 sub readfile { 8 my $fn=shift; 9 my @ar; 10 11 open IN,"<$fn" || die "Couldn't open '$fn': $!"; 12 ## snarf the header 13 scalar <IN>; 14 ## read the rest 15 while (<IN>) { 16 chomp; 17 push @ar,$_; 18 $max = $_ if $_ > $max; 19 } 20 close IN; 21 22 @ar; 23 } 24 25 26 ## load the files into arrays 27 my @a=readfile 'f1.lst'; 28 my @b=readfile 'f2.lst'; 29 my @c=readfile 'f3.lst'; 30 31 ## iterate until all arrays are empty 32 for (my $i = 0; $i <= $max; $i++) 33 { 34 next unless $a[0] == $i || $b[0] == $i || $c[0] == $i; 35 36 ## print them out (if they match) 37 printf "%5s", $a[0]==$i ? scalar shift @a : undef; 38 print " | "; 39 printf "%5s", $b[0]==$i ? scalar shift @b : undef; 40 print " | "; 41 printf "%5s", $c[0]==$i ? scalar shift @c : undef; 42 43 print "\n"; 44 }
      My only question is: In lines 36-41 you have a snippet  scalar shift @array. What is this doing? :) Thank you very much!!

      ----------
      - Jim

        In my original code (which does not work for the test data :() I detected the end of the looping by testing to see if any of the arrays where still defined (had contents). Since I needed to pull the next number off the list to display it, I just combined a few operations from
        if ($a[0]==$i) { printf "%5s",$a[0]; shift @a; }
        into the code you see. Oh, and I used scalar probably without need, but I wanted to make sure that the shift statement didn't pull off more than one element.