sophix has asked for the wisdom of the Perl Monks concerning the following question:

Hey guys, I have the following task: - Read two files into separate hashes - Find out duplicates by checking out the common keys (e.g., id) - Merge those two hashes while keeping only the duplicates and discarding the rest - Print out the new hash into an output file Note: Files are not of same size (both in rows and columns), and furthermore they contain multiple values Problems - Well, basically everything. But to begin with, I could not figure out how to read files into hashes while keeping all multiple values (It appears to catch one key and one value only) Should I read the all values as an array? I would appreciate your wisdom. Thanks! UPDATE: Thanks everyone! In one of the messages below, GrandFather provides a working code that carries out the task I described in my question.

Replies are listed 'Best First'.
Re: Hash w/ multiple values + merging
by biohisham (Priest) on Feb 07, 2010 at 22:05 UTC
    To associate more than one value with a hash key you need an advanced data structure, a hash of anonymous arrays, here is one way to do that, I am reading the data from two files, File1.txt and File2.txt:
    #!/usr/local/bin/perl use strict; use warnings; open (FH1, "File1.txt")or die("Error opening File1 $!\n"); my %hash1; while(<FH1>){ next if /(trip|valu1|valu2)/; #skip the header my ($key, $val1, $val2)= split /\s+/; push @{$hash1{$key}}, ($val1, $val2); #hash of anonymous + array } close FH1; open(FH2, "File2.txt")or die("Error opening File1 $!\n"); my %hash2; while(<FH2>){ next if /(trip|value)/; my($key, $val3)=split /\s+/; $hash2{$key}=$val3; } close FH2; #convey common keys into one hash.. my %hash3; my ($key1, $key2); foreach $key1(keys %hash1){ foreach $key2(keys %hash2){ if ($key2 eq $key1){ push @{$hash3{$key1}}, @{$hash1{$key1}}, $hash2{$ke +y2}; } } } #Print to STDOUT print "TRIP\tvalue1\tvalue2\tvalue3\n"; foreach my $key(keys %hash3){ print "$key\t"; print "@{$hash3{$key}}\n"; }
    UPDATE: Perlref and Perlreftut are additional must-reads, to be able to manipulate the data structures you need to have an idea on references and how to dereference them..>


    Excellence is an Endeavor of Persistence. Chance Favors a Prepared Mind.
      Thanks for the reply! I tried to print out %hash1 to see whether it has the right thing, it lists the keys as expected but then lists the values as array (okay) but does not show them explicitly.
      ATA ARRAY(0x183d294) CTT ARRAY(0x183d304) CTG ARRAY(0x182a674) TTA ARRAY(0x183d464) ATG ARRAY(0x278eb4)

        You're printing out the stringified references. An easy way to inspect data structures is Data::Dumper.

        what you are seeing is a reference to the data held under that key, so "ARRAY(0x183d294)" for example is the location where the value associated with "ATA" is stored, to access that value you need to dereference it using the appropriate dereferencers, read the links at the bottom of my previous reply.

        Since the reference type in this case is of an ARRAY something like "@$hash{ATA}}" would show you the values associated with "ATA", to access them one at a time you can specify indices like you do any regular arrays; $hash{ATA}[0] would print the first element of the anonymous array associated to the key "ATA"..

        The module Data::Dumper would show you the data structures stringified so that you could judge if they look like you expected them before proceeding any further...

        #ADD this to the previous code... use Data::Dumper; print Data::Dumper->Dump([\%hash1],['FIRST HASH']),"\n"; print Data::Dumper->Dump([\%hash2],['SECOND HASH']),"\n"; print Data::Dumper->Dump([\%hash3],['MERGED HASH']),"\n";

        Note also that there can be more than one way to do it which would become clearer when you start dealing with more complex data structures..


        Excellence is an Endeavor of Persistence. Chance Favors a Prepared Mind.
Re: Hash w/ multiple values + merging
by AnomalousMonk (Archbishop) on Feb 07, 2010 at 21:34 UTC
    ... I have the following [ill-defined] task: - Read two files into separate hashes ...

    To begin with, what are the structures of the files? What are the structures of the hashes? Pondering the answers to these questions may show you a way to begin to fulfill your task. Then, write some code (as erix has suggested in the CB) to implement just this first step, and we can see where to go from there.

Re: Hash w/ multiple values + merging
by sophix (Sexton) on Feb 07, 2010 at 21:32 UTC
    Here is what I got so far:
    open(FILE1, $ARGV[0]); while($line = <FILE1>) { chomp($line); @words = split(/\t/, $line); $hash1{$words[0]} = $words[1]; } close(FILE1); open(FILE3, ">$ARGV[2]"); while (($key, $value) = each(%hash1)){ print FILE3 $key."\t".$value."\n"; } print "\n"; open(FILE1, $ARGV[1]); %hash2 = map { chomp; split /\t/ } <FILE1>; close(FILE1); while (($key, $value) = each(%hash2)){ print FILE3 $key."\t".$value."\n"; } print "\n";
    __DATA1__ trip value1 value2 ATG adsad dsf CTG 23432 2342 TTA 24312 144 CTT 452 5fw ATA rff sgsh __DATA2__ trip value3 ATG asdas CCG asdadd TTA 24 CAT 45 __OUTPUT / DESIRED__ trip value1 value2 value3 ATG adsad dsf asdas TTA 24312 144 24

      The following code assumes that a header line is provided for each data set as shown in the sample data. A hash is built containing the merged data from both files then only those records containing data for all columns is printed.

      #!/usr/bin/perl use strict; use warnings; my $data1 = <<DATA1; trip value1 value2 ATG adsad dsf CTG 23432 2342 TTA 24312 144 CTT 452 5fw ATA rff sgsh DATA1 my $data2 = <<DATA2; trip value3 ATG asdas CCG asdadd TTA 24 CAT 45 DATA2 my %data; my @columnNames; open my $in, '<', \$data1; push @columnNames, parseFile (\%data, $in); close $in; open $in, '<', \$data2; push @columnNames, parseFile (\%data, $in); close $in; my $format = (('%-9s ') x (@columnNames + 1)) . "\n"; printf $format, '', @columnNames; for my $key (sort keys %data) { next if keys %{$data{$key}} != @columnNames; printf $format, $key, @{$data{$key}}{@columnNames}; } sub parseFile { my ($dataRef, $inFile) = @_; my $header = <$inFile>; my ($keyColumn, @columns) = map {chomp; split} $header; while (defined (my $line = <$inFile>)) { chomp $line; my ($key, @data) = split /\s+/, $line; @{$dataRef->{$key}}{@columns} = @data; } return @columns; }

      Prints:

      value1 value2 value3 ATG adsad dsf asdas TTA 24312 144 24

      Note that strictures are used. Always use strictures (use strict; use warnings;). The three parameter version of open is used with lexical file handles.

      @{$data{$key}}{@columnNames} and @{$dataRef->{$key}}{@columns} are hash slices - they access a list of hash values. The first case returns the list of values to be printed for a row. The second case is used to assign the list of column values to a record.

      Note that parseFile doesn't check to see that data column names for the current file are different than any previous file nor that the key column name (assumed to be the first) is the same. Those are all things that can be fixed if you need them to be.


      True laziness is hard work
        Thank you very much for this script! I would like to ask for some possible modifications. - I am not familiar with the open structure in this script. I would like to convert it to a familiar one (open(FILE1, "$ARGV[0]") etc.) but I could not do it. I tried the following:
        my $data1 = $ARGV[0]; my $data2 = $ARGV[1]; my $data3 = $ARGV[2];
        - Second, I failed at printing out once again. I used this one: print Data::Dumper->Dump([\%data],['MERGED HASH']),"\n"; How can I print the merged hash into an output file? I though of, again, the familiar structure, but it did not work.
        open(FILE3, ">$ARGV[2]"); {print Data::Dumper->Dump([\%data],['MERGED HASH']),"\n";}
        Is it the reference again?