Combining Files (code break down)

This is an explanation of the code I wrote above, at the request of some users. Again, this is not very flexible code, as it depends on always matching the same fields, but is a demonstration of using hashes, anonymous arrays, and references.



use strict;

## "use strict": never leave home without it.

my $first  = shift or die "Need the first file!\n";
my $second = shift or die "Need the second file!\n";
my $third  = shift or die "Need the third file!\n";

## Grab the files we want to parse and create, 
## using the shift trick to grab items off of the 
## @ARGV array.

open(FIRST,  "$first")  or die "Could not open $first: $!\n";
open(SECOND, "$second") or die "Could not open $second: $!\n";
open(THIRD,  ">$third") or die "Could not write $third: $!\n";

## We try and open all three files first, since the first 
## operation can be memory/time-intensive, and there is 
## no point in doing that if any of the three files cannot 
## be read or opened. Hint: use or, not || after an open


my (%first, %found);

## %first will store the information from the first data 
## file, and %found is used to keep track of whether or 
## not each item in the first data fie is used in the 
## second.

while(<FIRST>) {

  my ($key, @cols) = split(m#\|# => $_, -1);

  ## For every line in the first data file, we split it 
  ## on the "pipe" character. We give split an argument 
  ## of -1 to indicate that we do not want trailing null 
  ## fields to be stripped. Once they are split, the first 
  ## item, or field, from the line is put into the variable 
  ## $key, and the rest are put into the array @cols.

  ## Example: for the line:
  ## 12345|8432|FooBar||BAZ|||
  ## $key becomes "12345"
  ## @cols becomes ("8432", "FooBar", "", "BAZ", "", "")

  push(@{$first{$key}}, \@cols);

  ## This is a little tricky. We are pushing a reference to 
  ## the cols array into an anonymous array, which is the 
  ## value for the hash key $key.

  ## To break this down a bit:
  ## my $refarray = \@cols; ## reference to cols array
  ## $first{$key} = (); ## anonymous array
  ## push(@{$first{$key}}, $refarray);

  ## We use the anonymous array for the cases where more 
  ## than one line has the same key. If we knew that each 
  ## key was unique and never appeared on more than one 
  ## line, we could use:
  ## $first{$key} = \@cols;
  ## But, since we may have multiples, the value of the hash
  ## points to an array. Each item in that array is another 
  ## array, which contains the information that split grabbed.

  ## This part can be confusing at first, but it's not so bad 
  ## once you get the hang of references. Remember that every 
  ## hash value, and every array element, can contain exactly 
  ## one and only one peice of information, be it a string, 
  ## a number, or a reference. Think of a reference as a piece 
  ## of string that leads to another container, usually a 
  ## hash or an array.

  $found{$key} ||= $.;

  ## Finally moving on. :) This stores the current line number 
  ## of the first data file into the hash %found, at the key 
  ## "$key". We could store this into an array as well, since 
  ## the same key may appear on multiple lines, but in this 
  ## case, grabbing only one of the lines is good enough.
  ## We use the ||= so that only the first line in which the 
  ## key appears is saved, and the rest discarded.
}
close(FIRST);

## At this point, the first data file is completely in memory, 
## stored in the hash %first.


while(<SECOND>) {

  ## Now we read in each line of the second file, similar 
  ## to the way we did the first file. Since the first 
  ## data file is in memory, we do not need to store the 
  ## second file into memory, but can simply parse it 
  ## line by line.

  my ($one, $key, @cols) = split(m#\|# => $_, -1);

  ## In this case, we split the same as before, but use 
  ## the second field as the matching key. The rest of 
  ## the line is stored in $one @cols

  if (exists $first{$key}) {
    delete $found{$key};
    for (@{$first{$key}}) {
      print THIRD "$_->[0]|$one|$cols[0]\n";
    }
  }

  ## First we make sure that this key was seen before, by 
  ## checking to see if the key exists in the hash %first. 
  ## Remember to always use exists and not simply a 
  ## if ($first{$key}) when checking hash elements, as 
  ## testing for existence is not the same as testing 
  ## for "truth".

  ## Next, we delete an element from the hash %found. We do 
  ## not need to check for the existence of this one, as 
  ## it is automatically set when $first{$key} is. However, 
  ## note that if we find a line with the same key in the 
  ## second file, the delete will fail as that key has already 
  ## been removed from the hash. This is not a big deal, but 
  ## to be really precise we could have said:
  ## delete $found{$key} if exists $found{$key};

  ## Next, we go into the array pointed to by $first{$key}. 
  ## Recall that every line from the first data file with 
  ## that key will create another item in the array. 
  ## Finally, we print out (to the third file, open for writing) 
  ## the first element of the array, which is the *second* 
  ## "pipe-delimited" field of the first data file. Then we 
  ## print the *first* element of the second data file line 
  ## we are currently reading, followed by the third field 
  ## from the second, which is simply the first element of 
  ## the cols array. (We could even make this a simple scalar 
  ## since we never use the rest of this array).

  ## That's the main part of the program. The rest is just 
  ## cleanup and error checking

  else {
    printf "Line %5d: Record %5d exists in $second but not in $first\n",
           $.,$key;

    ## If we find a line in the second data file that does not 
    ## have a match in the first file, we write a message to 
    ## STDOUT telling the line number and key where this occured.

  }
}
close(SECOND);
close(THIRD);

## Double check the first file:
for (sort {$found{$a} <=> $found{$b}} keys %found) {

  ## We loop through all the remaining items in the %found hash,
  ## sorting them by the line number for easier output. Keys 
  ## that were found in the second data file have been deleted.

  printf "Line %5d: Record %5d exists in $first but not in $second\n",
         $found{$_},$_;
}
Hope this helps. Questions welcome.
Comment on Combining Files (code break down)