in reply to Combining Files
This is an explanation of the code I wrote above, at the request of some users. Again, this is not very flexible code, as it depends on always matching the same fields, but is a demonstration of using hashes, anonymous arrays, and references.
use strict; ## "use strict": never leave home without it. my $first = shift or die "Need the first file!\n"; my $second = shift or die "Need the second file!\n"; my $third = shift or die "Need the third file!\n"; ## Grab the files we want to parse and create, ## using the shift trick to grab items off of the ## @ARGV array. open(FIRST, "$first") or die "Could not open $first: $!\n"; open(SECOND, "$second") or die "Could not open $second: $!\n"; open(THIRD, ">$third") or die "Could not write $third: $!\n"; ## We try and open all three files first, since the first ## operation can be memory/time-intensive, and there is ## no point in doing that if any of the three files cannot ## be read or opened. Hint: use or, not || after an open my (%first, %found); ## %first will store the information from the first data ## file, and %found is used to keep track of whether or ## not each item in the first data fie is used in the ## second. while(<FIRST>) { my ($key, @cols) = split(m#\|# => $_, -1); ## For every line in the first data file, we split it ## on the "pipe" character. We give split an argument ## of -1 to indicate that we do not want trailing null ## fields to be stripped. Once they are split, the first ## item, or field, from the line is put into the variable ## $key, and the rest are put into the array @cols. ## Example: for the line: ## 12345|8432|FooBar||BAZ||| ## $key becomes "12345" ## @cols becomes ("8432", "FooBar", "", "BAZ", "", "") push(@{$first{$key}}, \@cols); ## This is a little tricky. We are pushing a reference to ## the cols array into an anonymous array, which is the ## value for the hash key $key. ## To break this down a bit: ## my $refarray = \@cols; ## reference to cols array ## $first{$key} = (); ## anonymous array ## push(@{$first{$key}}, $refarray); ## We use the anonymous array for the cases where more ## than one line has the same key. If we knew that each ## key was unique and never appeared on more than one ## line, we could use: ## $first{$key} = \@cols; ## But, since we may have multiples, the value of the hash ## points to an array. Each item in that array is another ## array, which contains the information that split grabbed. ## This part can be confusing at first, but it's not so bad ## once you get the hang of references. Remember that every ## hash value, and every array element, can contain exactly ## one and only one peice of information, be it a string, ## a number, or a reference. Think of a reference as a piece ## of string that leads to another container, usually a ## hash or an array. $found{$key} ||= $.; ## Finally moving on. :) This stores the current line number ## of the first data file into the hash %found, at the key ## "$key". We could store this into an array as well, since ## the same key may appear on multiple lines, but in this ## case, grabbing only one of the lines is good enough. ## We use the ||= so that only the first line in which the ## key appears is saved, and the rest discarded. } close(FIRST); ## At this point, the first data file is completely in memory, ## stored in the hash %first. while(<SECOND>) { ## Now we read in each line of the second file, similar ## to the way we did the first file. Since the first ## data file is in memory, we do not need to store the ## second file into memory, but can simply parse it ## line by line. my ($one, $key, @cols) = split(m#\|# => $_, -1); ## In this case, we split the same as before, but use ## the second field as the matching key. The rest of ## the line is stored in $one @cols if (exists $first{$key}) { delete $found{$key}; for (@{$first{$key}}) { print THIRD "$_->[0]|$one|$cols[0]\n"; } } ## First we make sure that this key was seen before, by ## checking to see if the key exists in the hash %first. ## Remember to always use exists and not simply a ## if ($first{$key}) when checking hash elements, as ## testing for existence is not the same as testing ## for "truth". ## Next, we delete an element from the hash %found. We do ## not need to check for the existence of this one, as ## it is automatically set when $first{$key} is. However, ## note that if we find a line with the same key in the ## second file, the delete will fail as that key has already ## been removed from the hash. This is not a big deal, but ## to be really precise we could have said: ## delete $found{$key} if exists $found{$key}; ## Next, we go into the array pointed to by $first{$key}. ## Recall that every line from the first data file with ## that key will create another item in the array. ## Finally, we print out (to the third file, open for writing) ## the first element of the array, which is the *second* ## "pipe-delimited" field of the first data file. Then we ## print the *first* element of the second data file line ## we are currently reading, followed by the third field ## from the second, which is simply the first element of ## the cols array. (We could even make this a simple scalar ## since we never use the rest of this array). ## That's the main part of the program. The rest is just ## cleanup and error checking else { printf "Line %5d: Record %5d exists in $second but not in $first\n", $.,$key; ## If we find a line in the second data file that does not ## have a match in the first file, we write a message to ## STDOUT telling the line number and key where this occured. } } close(SECOND); close(THIRD); ## Double check the first file: for (sort {$found{$a} <=> $found{$b}} keys %found) { ## We loop through all the remaining items in the %found hash, ## sorting them by the line number for easier output. Keys ## that were found in the second data file have been deleted. printf "Line %5d: Record %5d exists in $first but not in $second\n", $found{$_},$_; }
Hope this helps. Questions welcome.
|
|---|