Hash Comparisions

perl_n00b has asked for the wisdom of the Perl Monks concerning the following question:

Okay I have another one for you guys:
This code is extremely inefficient (edit: actually I don't think this even works and something in it is hanging my system). Is there a better way to write it? (my files are very large)
I have 2 files that look like this:
UR_CI file

G_00160 F_02571
G_00161 F_01082
G_00162 F_00034
G_00163 F_00035
G_00164 F_00036
[download]

and
CI_UR file

F_00013 G_06670
F_00034 G_00162
F_00035 G_00163
F_00036 G_00164
F_00038 G_00165
[download]

I am trying to find the matches that look like "G_00163 F_00035" & "F_00035 G_00163"

open $IN1, "<", $infile1 or die "cannot open $infile1: $!\n";
open $IN2, "<", $infile2 or die "cannot open $infile2: $!\n";
open $OUT, ">", $outfile or die "cannot open $outfile: $!\n";

my(@UR_CI) = <$IN1>;
my(@CI_UR) = <$IN2>;

my(%UR_CI, %CI_UR);

foreach my $seq (@UR_CI) {
   if ($seq =~ /^(\w+)\s+(\w+)$/) {
      $UR_CI{"$1"} = "$2";
   }
}

foreach my $seq (@CI_UR) {
   if ($seq =~ /^(\w+)\s+(\w+)$/) {
      $CI_UR{"$2"} = "$1";
   }
}

foreach my $key1 (keys %UR_CI) {
foreach my $key2 (keys %CI_UR) {
foreach my $value1 (values %UR_CI) {
foreach my $value2 (values %CI_UR) {
   if (($UR_CI{$key1} eq $CI_UR{$key2}) && ($UR_CI{$value1} ne $CI_UR{
+$value2})) {
         delete $UR_CI{$key1};
   }
}
}
}
}

while ( my ($key, $value) = each(%UR_CI) ) {
        print $OUT "$key => $value\n";
}
[download]

Any ideas?
Any help is greatly appreciated!

Comment on Hash Comparisions Select or Download Code

Replies are listed 'Best First'.
Re: Hash Comparisions by GrandFather (Saint) on Nov 23, 2009 at 10:29 UTC
Generally nested loops is a code smell. Nesting loops four deep goes beyond stinking to somewhere around putrid! For modest size files - say up to a few hundred megabytes for the smaller of them (the size of the second file doesn't matter) reading the smaller file into a hash and using that as a lookup is the preferred solution. Consider: use strict; use warnings; my $inData1 = <<DATA1; G_00160 F_02571 G_00161 F_01082 G_00162 F_00034 G_00163 F_00035 G_00164 F_00036 DATA1 my $inData2 = <<DATA2; F_00013 G_06670 F_00034 G_00162 F_00035 G_00163 F_00036 G_00164 F_00038 G_00165 DATA2 my $outfile; open my $ur_ci, "<", \$inData1; my %urCi = map {chomp; split} <$ur_ci>; close $ur_ci; my %matches; open my $ci_ur, "<", \$inData2; while (<$ci_ur>) { chomp; my ($ci, $ur) = split; $matches{$ci} = $ur if exists $urCi{$ur} && $urCi{$ur} eq $ci; } print "$_ => $matches{$_}\n" for sort keys %matches; [download] Prints: `F_00034 => G_00162 F_00035 => G_00163 F_00036 => G_00164` [download] If your input files are both rather larger than would easily fit in memory (more than 1/4 your memory size for the smallest) then you should really consider using a database. If this is a one off task SQLite may be a good choice. Consider: use strict; use warnings; use DBI; my $inData1 = <<DATA1; G_00160 F_02571 G_00161 F_01082 G_00162 F_00034 G_00163 F_00035 G_00164 F_00036 DATA1 my $inData2 = <<DATA2; F_00013 G_06670 F_00034 G_00162 F_00035 G_00163 F_00036 G_00164 F_00038 G_00165 DATA2 unlink 'db.SQLite'; my $dbh = DBI->connect ("dbi:SQLite:dbname=db.SQLite","",""); $dbh->do ('CREATE TABLE urci (ci TEXT, ur TEXT)'); $dbh->do ('CREATE TABLE ciur (ur TEXT, ci TEXT)'); my $sth = $dbh->prepare ('INSERT INTO urci (ur, ci) VALUES (?, ?)'); open my $ur_ci, "<", \$inData1; $sth->execute (do {chomp; split}) while <$ur_ci>; close $ur_ci; $sth = $dbh->prepare ('INSERT INTO ciur (ci, ur) VALUES (?, ?)'); open my $ci_ur, "<", \$inData2; $sth->execute (do {chomp; split}) while <$ci_ur>; close $ci_ur; $sth = $dbh->prepare ( 'SELECT * FROM ciur INNER JOIN urci ON ciur.ci = urci.ci AND ciur. +ur = urci.ur' ); $sth->execute (); print "$_->{ci} => $_->{ur}\n" while $_ = $sth->fetchrow_hashref (); [download] Prints: `F_00034 => G_00162 F_00035 => G_00163 F_00036 => G_00164` [download] True laziness is hard work	[reply] [d/l] [select]
Re: Hash Comparisions by Khen1950fx (Canon) on Nov 23, 2009 at 08:31 UTC
I made a few minor adjustments: die "cannot open ${infile1}: $!\n" unless open $IN1, '<', $infile1; die "cannot open ${infile2}: $!\n" unless open $IN2, '<', $infile2; die "cannot open ${outfile}: $!\n" unless open $OUT, '>', $outfile; my (@UR_CI) = <$IN1>; my (@CI_UR) = <$IN2>; my ( %UR_CI, %CI_UR ); foreach my $seq (@UR_CI) { if ( $seq =~ /^(\w+)\s+(\w+)$/ ) { $UR_CI{"$1"} = "$2"; } } foreach my $seq (@CI_UR) { if ( $seq =~ /^(\w+)\s+(\w+)$/ ) { $CI_UR{"$2"} = "$1"; } } foreach my $key1 ( keys %UR_CI ) { foreach my $key2 ( keys %CI_UR ) { foreach my $value1 ( values %UR_CI ) { foreach my $value2 ( values %CI_UR ) { if ( $UR_CI{$key1} eq $CI_UR{$key2} and $UR_CI{$value1} ne $CI_UR{$value2} ) { delete $UR_CI{$key1}; } } } } } while ( my ( $key, $value ) = each %UR_CI ) { print $OUT "$key => $value\n"; } [download]	[reply] [d/l]
Re: Hash Comparisions by grizzley (Chaplain) on Nov 23, 2009 at 08:36 UTC
Load only first file to a hash. Read second one - line after line and process one line only (search, store in output file etc.).	[reply]
Re: Hash Comparisions by vitoco (Hermit) on Nov 23, 2009 at 12:21 UTC
I don't know if one UR key may appear many times in one input file, each time with the same or different CI keys, or vice versa. If so, you the hash key must be the whole (chomped) record of one input files, and proceed just like grizzley said, but reversing the fields of the records from the second file when testing for existence in the loop. Also, if there are many records with exact the same keys on the second file, you can delete the full key from the hash each time you match it, just to get unique records in the output file. Bonus track (not tested): `#!perl use strict; use warnings; # ... open $IN1, "<", $infile1 or die "cannot open $infile1: $!\n"; open $IN2, "<", $infile2 or die "cannot open $infile2: $!\n"; open $OUT, ">", $outfile or die "cannot open $outfile: $!\n"; my %pair = (); while (<$IN1>) { chomp; s/^(\w+)\s+(\w+)$/$1 $2/; # just one space between keys $pair{$_} = 1; } while (<$IN2>) { chomp; s/^(\w+)\s+(\w+)$/$2 $1/; # swap keys if (exists $pair{$_}) { print $OUT "$_\n"; delete $pair{$_}; } } close $IN1; close $IN2; close $OUT;` [download] Update: Initialized and changed name of the hash and newline added on output records. Still not tested!	[reply] [d/l]
Re^2: Hash Comparisions by Anonymous Monk on Nov 23, 2009 at 18:05 UTC
Amazing! Thank you so much Vitoco and GrandFather, I really appreciate it!	[reply]