perl_n00b has asked for the wisdom of the Perl Monks concerning the following question:

Okay I have another one for you guys:
This code is extremely inefficient (edit: actually I don't think this even works and something in it is hanging my system). Is there a better way to write it? (my files are very large)
I have 2 files that look like this:
UR_CI file
G_00160 F_02571 G_00161 F_01082 G_00162 F_00034 G_00163 F_00035 G_00164 F_00036
and
CI_UR file
F_00013 G_06670 F_00034 G_00162 F_00035 G_00163 F_00036 G_00164 F_00038 G_00165

I am trying to find the matches that look like "G_00163 F_00035" & "F_00035 G_00163"
open $IN1, "<", $infile1 or die "cannot open $infile1: $!\n"; open $IN2, "<", $infile2 or die "cannot open $infile2: $!\n"; open $OUT, ">", $outfile or die "cannot open $outfile: $!\n"; my(@UR_CI) = <$IN1>; my(@CI_UR) = <$IN2>; my(%UR_CI, %CI_UR); foreach my $seq (@UR_CI) { if ($seq =~ /^(\w+)\s+(\w+)$/) { $UR_CI{"$1"} = "$2"; } } foreach my $seq (@CI_UR) { if ($seq =~ /^(\w+)\s+(\w+)$/) { $CI_UR{"$2"} = "$1"; } } foreach my $key1 (keys %UR_CI) { foreach my $key2 (keys %CI_UR) { foreach my $value1 (values %UR_CI) { foreach my $value2 (values %CI_UR) { if (($UR_CI{$key1} eq $CI_UR{$key2}) && ($UR_CI{$value1} ne $CI_UR{ +$value2})) { delete $UR_CI{$key1}; } } } } } while ( my ($key, $value) = each(%UR_CI) ) { print $OUT "$key => $value\n"; }

Any ideas?
Any help is greatly appreciated!

Replies are listed 'Best First'.
Re: Hash Comparisions
by GrandFather (Saint) on Nov 23, 2009 at 10:29 UTC

    Generally nested loops is a code smell. Nesting loops four deep goes beyond stinking to somewhere around putrid! For modest size files - say up to a few hundred megabytes for the smaller of them (the size of the second file doesn't matter) reading the smaller file into a hash and using that as a lookup is the preferred solution. Consider:

    use strict; use warnings; my $inData1 = <<DATA1; G_00160 F_02571 G_00161 F_01082 G_00162 F_00034 G_00163 F_00035 G_00164 F_00036 DATA1 my $inData2 = <<DATA2; F_00013 G_06670 F_00034 G_00162 F_00035 G_00163 F_00036 G_00164 F_00038 G_00165 DATA2 my $outfile; open my $ur_ci, "<", \$inData1; my %urCi = map {chomp; split} <$ur_ci>; close $ur_ci; my %matches; open my $ci_ur, "<", \$inData2; while (<$ci_ur>) { chomp; my ($ci, $ur) = split; $matches{$ci} = $ur if exists $urCi{$ur} && $urCi{$ur} eq $ci; } print "$_ => $matches{$_}\n" for sort keys %matches;

    Prints:

    F_00034 => G_00162 F_00035 => G_00163 F_00036 => G_00164

    If your input files are both rather larger than would easily fit in memory (more than 1/4 your memory size for the smallest) then you should really consider using a database. If this is a one off task SQLite may be a good choice. Consider:

    use strict; use warnings; use DBI; my $inData1 = <<DATA1; G_00160 F_02571 G_00161 F_01082 G_00162 F_00034 G_00163 F_00035 G_00164 F_00036 DATA1 my $inData2 = <<DATA2; F_00013 G_06670 F_00034 G_00162 F_00035 G_00163 F_00036 G_00164 F_00038 G_00165 DATA2 unlink 'db.SQLite'; my $dbh = DBI->connect ("dbi:SQLite:dbname=db.SQLite","",""); $dbh->do ('CREATE TABLE urci (ci TEXT, ur TEXT)'); $dbh->do ('CREATE TABLE ciur (ur TEXT, ci TEXT)'); my $sth = $dbh->prepare ('INSERT INTO urci (ur, ci) VALUES (?, ?)'); open my $ur_ci, "<", \$inData1; $sth->execute (do {chomp; split}) while <$ur_ci>; close $ur_ci; $sth = $dbh->prepare ('INSERT INTO ciur (ci, ur) VALUES (?, ?)'); open my $ci_ur, "<", \$inData2; $sth->execute (do {chomp; split}) while <$ci_ur>; close $ci_ur; $sth = $dbh->prepare ( 'SELECT * FROM ciur INNER JOIN urci ON ciur.ci = urci.ci AND ciur. +ur = urci.ur' ); $sth->execute (); print "$_->{ci} => $_->{ur}\n" while $_ = $sth->fetchrow_hashref ();

    Prints:

    F_00034 => G_00162 F_00035 => G_00163 F_00036 => G_00164

    True laziness is hard work
Re: Hash Comparisions
by Khen1950fx (Canon) on Nov 23, 2009 at 08:31 UTC
    I made a few minor adjustments:

    die "cannot open ${infile1}: $!\n" unless open $IN1, '<', $infile1; die "cannot open ${infile2}: $!\n" unless open $IN2, '<', $infile2; die "cannot open ${outfile}: $!\n" unless open $OUT, '>', $outfile; my (@UR_CI) = <$IN1>; my (@CI_UR) = <$IN2>; my ( %UR_CI, %CI_UR ); foreach my $seq (@UR_CI) { if ( $seq =~ /^(\w+)\s+(\w+)$/ ) { $UR_CI{"$1"} = "$2"; } } foreach my $seq (@CI_UR) { if ( $seq =~ /^(\w+)\s+(\w+)$/ ) { $CI_UR{"$2"} = "$1"; } } foreach my $key1 ( keys %UR_CI ) { foreach my $key2 ( keys %CI_UR ) { foreach my $value1 ( values %UR_CI ) { foreach my $value2 ( values %CI_UR ) { if ( $UR_CI{$key1} eq $CI_UR{$key2} and $UR_CI{$value1} ne $CI_UR{$value2} ) { delete $UR_CI{$key1}; } } } } } while ( my ( $key, $value ) = each %UR_CI ) { print $OUT "$key => $value\n"; }
Re: Hash Comparisions
by grizzley (Chaplain) on Nov 23, 2009 at 08:36 UTC
    Load only first file to a hash. Read second one - line after line and process one line only (search, store in output file etc.).
Re: Hash Comparisions
by vitoco (Hermit) on Nov 23, 2009 at 12:21 UTC

    I don't know if one UR key may appear many times in one input file, each time with the same or different CI keys, or vice versa.

    If so, you the hash key must be the whole (chomped) record of one input files, and proceed just like grizzley said, but reversing the fields of the records from the second file when testing for existence in the loop.

    Also, if there are many records with exact the same keys on the second file, you can delete the full key from the hash each time you match it, just to get unique records in the output file.

    Bonus track (not tested):

    #!perl use strict; use warnings; # ... open $IN1, "<", $infile1 or die "cannot open $infile1: $!\n"; open $IN2, "<", $infile2 or die "cannot open $infile2: $!\n"; open $OUT, ">", $outfile or die "cannot open $outfile: $!\n"; my %pair = (); while (<$IN1>) { chomp; s/^(\w+)\s+(\w+)$/$1 $2/; # just one space between keys $pair{$_} = 1; } while (<$IN2>) { chomp; s/^(\w+)\s+(\w+)$/$2 $1/; # swap keys if (exists $pair{$_}) { print $OUT "$_\n"; delete $pair{$_}; } } close $IN1; close $IN2; close $OUT;

    Update: Initialized and changed name of the hash and newline added on output records. Still not tested!

      Amazing! Thank you so much Vitoco and GrandFather, I really appreciate it!