Comparing two columns

Nalababu has asked for the wisdom of the Perl Monks concerning the following question:

How do you compare two columns of data and print the results. The values in one column is repeated so i want to compare with another column and print only once those values that match.

#!/usr/bin/perl -w
use strict;
use warnings;
use Getopt::Std;
my $file = "lethal_results2.txt"; # input file containing nubiscan res
+ults
my $ge = "geneid.txt"; # file with TE name and length
open(FILE,"<",$file) or die("Unable to open file $file: $!");
open(GE,"<",$ge) or die("Unable to open $ge: $!");
my @file_data = <FILE>;
my @ge_data = <GE>; #input stored in an array
close(FILE);
close(GE);
foreach my $line (@ge_data) 
{
    my @line = split(/\s+/,$line); 
    my $start = $line[0];
    #print"$line[0] \n";
    foreach my $values(@file_data)  
    {
        my @values = split(/\s+/, $values);
        my $id = $values[0];
        #print"$values[0] \n";
        my $width = 11;
        if ($values[0] eq $line[0]) 
        {
            #print"the gene id \n";
            print"$line[0] \n";
        }
    }
}
[download]

This code prints all the values just as it is split does not compare and print them as i want it.. The column1 data:

ENSMUSG00000050310
ENSMUSG00000025583
ENSMUSG00000021198
ENSMUSG00000052595
ENSMUSG00000015243
ENSMUSG00000024130
ENSMUSG00000031333
ENSMUSG00000026842
ENSMUSG00000026596
ENSMUSG00000020532
ENSMUSG00000026003
ENSMUSG00000023328
ENSMUSG00000029580
ENSMUSG00000054808
ENSMUSG00000026836
[download]

Column 2:

ENSMUSG00000050310 
ENSMUSG00000050310 
ENSMUSG00000050310 
ENSMUSG00000050310 
ENSMUSG00000050310 
ENSMUSG00000050310 
ENSMUSG00000050310 
ENSMUSG00000050310 
ENSMUSG00000025583 
ENSMUSG00000025583 
ENSMUSG00000025583 
ENSMUSG00000025583 
ENSMUSG00000025583
[download]

In the second column the values are repeated so i want to compare the values in both columns and then print those present in both only once without repetition.

Comment on Comparing two columns Select or Download Code

Replies are listed 'Best First'.
Re: Comparing two columns by apl (Monsignor) on Jul 10, 2009 at 14:02 UTC
If there is that much duplication, and that many columns, write a short script to split the one file (containing the two columns) into two files( A1, A2) of one column each use your native native sort utility on A1 and A2 to produce two files (B1, B2) containing only unique data use your native diff utility to do the comparison between B1 and B2 Never recreate what others have already done for you.	[reply]
Re: Comparing two columns by ig (Vicar) on Jul 10, 2009 at 15:52 UTC
The following makes no assumption about which of your input files has duplicates. It demonstrates the use of a hash of arrays to capture multiple values for each key. You may be able to adapt it to produce the results you need. my %data; foreach my $line (@ge_data) { my ($id, $rest) = split(/\s+/,$line,2); push(@{$data{$id}{ge_data}}, $rest); } foreach my $line (@file_data) { my ($id, $rest) = split(/\s+/,$line,2); push(@{$data{$id}{file_data}}, $rest); } foreach my $id (sort keys %data) { next unless(exists $data{$id}{ge_data}); next unless(exists $data{$id}{file_data}); print "$id:\n"; print "\tge_data:\n"; print "\t\t" . join("\t\t",@{$data{$id}{ge_data}}); print "\tfile_data:\n"; print "\t\t" . join("\t\t",@{$data{$id}{file_data}}); } [download]	[reply] [d/l]
Re: Comparing two columns by BioLion (Curate) on Jul 10, 2009 at 11:42 UTC
Can you post some representative sample data please? Although it sounds like you would be better off using a hash for lookups, rather than iterating a whole array for every line - see Tie::File::AsHash. If you have multiple entries for a given id, you might be better rolling your own `push @{ $data{ $split_line[0] } }, $line;` Just a something something...	[reply] [d/l]
Re^2: Comparing two columns by Nalababu (Initiate) on Jul 10, 2009 at 12:51 UTC
I have posted some sample data.	[reply]
Re^3: Comparing two columns by BioLion (Curate) on Jul 10, 2009 at 13:22 UTC
The data still isn't really representative though is it!? :P Just store the file with the single entries as a hash: #!/usr/bin/perl use strict; use warnings; my $first = "single_records.txt"; my $second = "multiple_records.txt"; warn "tie-ing $first...\n"; tie my %hsh, 'Tie::File::AsHash', $first, split => qr/\s+/ or die "Problem tying %hash: $!"; open (my $fh, '<', $second) \|\| die "Failed to open $second : $!" ## al +ways check for success on fh while (<$fh>){ my $line = chomp($_); my ($id,) = split /\s+/, $line; ## capture id ## compare to tied hsh of single records if (exists$hsh{$id}){ print "$line matched $id : $hsh{$id}\n"; } } ## tidy up close $fh \|\| die "Failed to close $second : $!"; untie %hsh; [download] Or you can do the other thing i suggestted and hold an array ref of all the lines with a certain id, then print ot all the lines when you see the id in the single record file. #!/usr/bin/perl use strict; use warnings; my $first = "single_records.txt"; my $second = "multiple_records.txt"; open (my $fh, '<', $second) \|\| die "Failed to open $second : $!" ## al +ways check for success on fh my %second_records = (); while (<$fh>){ my $line = $_; my ($id,) = split /\s+/, $line; ## capture id push @{ $second_records{$id} }, $line; } ## tidy up close $fh \|\| die "Failed to close $second : $!"; ## open single record file and compare open ($fh, '<', $first) \|\| die "Failed to open $first : $!"; while (<$fh>){ my $line = chomp( $_ ); my ($id,) = split /\s+/, $line; ## capture id if ( exists$second_records{$id} ){ print (join '', "$line matches records:\n", @{ $second_records{$id +} }); } } close $fh \|\| die "Failed to close $first : $!"; [download] Hope this helps. Just a something something...	[reply] [d/l] [select]
Re: Comparing two columns by mzedeler (Pilgrim) on Jul 10, 2009 at 20:23 UTC
This snipplet just prints common keys in the two files `file1.txt` and `file2.txt`: `se strict; use warnings; sub get_file { open my $FILE, '<', shift or die $!; return map {chop; $_ => $_} <$FILE>; } my %a = get_file 'file1.txt'; my %b = get_file 'file2.txt'; { no warnings 'uninitialized'; print "$_\n" for grep {$_} @a{keys %b}; }` [download] The above can easilly be done using `sort`, `uniq` and `grep`.	[reply] [d/l] [select]
Re: Comparing two columns by perliff (Monk) on Jul 11, 2009 at 09:05 UTC
You can use the native unix sort and comm commands if you have the column data in separate files. e.g. file1 contains column1, file2 contains column2 data, type the following on the command line. `sort file1 > file1.sorted sort file2 > file2.sorted comm -12 file1.sorted file2.sorted \| sort \| uniq > result.txt` [download] The result.txt file contains the values that are present in both files. (we remove the repeating lines with the sort and uniq) or you can use it inside the program, like this system ("sort file2 > file2.sorted"); system ("sort file1 > file1.sorted"); my $result = `comm -12 file1.sorted file2.sorted \| sort \| uniq`; [download] Now the `$result` contains the identifiers that are common to both files but are not repeated. perliff ---------------------- -with perl on my side	[reply] [d/l] [select]